REPORT  DOCUMENTATION  PAGE 


Form  A  o orov*<j 
CMB  No  0704^0188 


•■’cbadt  ceri:oiTv  r  mtZ.CAT.QN 


AD-A217  577 


R(S) 


6a.  NAME  OF  PERFORMING  ORGANIZATION 


University  of  Minnesota 


6b.  OFFICE  SYMBOL 
(if  applicable ) 


'  b  RESTRICTIVE  MARKINGS 


3.  OlSTRIBUT.ON/ AVAILABILITY  OF  REPORT 

Unlimited 


5.  MONITORING  ORGANIZATION  REPORT  NUMBER(S) 

AWBfiR  ■  TK  •  90-0006 


7a.  NAME  OF  MONITORING  ORGANIZATION 

AFOSR 


6c.  A00RESS  (City.  State,  ana  ZIP  Code) 
4-192  EE/CSci 
200  Union  Street  SE 
Minneapolis,  MN  55455 


7b  ADDRESS  (Cry,  Stat*.  and  ZIPCode) 

Bolling  AFB,  Washington  DC  20332-6448 


8a.  NAME  OF  FUNDING /SPONSORING 
ORGANIZATION 


8b.  OFFICE  SYMBOL 

(If  applicable) 


Kj  ri 


9  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 

_ AF0SR-87-0168 


8c.  ADDRESS  (City,  State,  ana  ZIPCode) 

biu-j  ■  c 


10  SOURCE  OF  FUNDING  NUMBERS 


-i  (  Cj 


/-)F(S/DC  -63 


PROGRAM 
ELEMENT  NO 


PROJECT 

NO 

£3  OV 


11  TITLE  (include  Security  Classification) 


Final  Report  -  Structure  From  Motion 


12.  PERSONAL  AUTHOR(S) 


13*.  TYPE  OF  REPORT 

13b.  TIME  COVERED 

14.  date  OF  REPORT  {Year,  Month,  Day) 

Final 

FROM  4/1/88  TOO/ 30/88 

1988  November  17 

16.  SUPPLEMENTARY  NOTATION 


1  17  COSATI  COOES  j 

FIELD 

GROUP 

SUB-GROUP 

18  SUBJECT  TERMS  {Continue  on  reverse  if  necessary  and  identify  by  block  number) 
‘Image  Understanding,  Time-Varying  Image  Analysis, 
Visual  Motion,  Optical  Flow,  Segmentation,  0-  , 

1 _ X  V  \  \  •,  O  r  ,  ■-  a  ,  S  P'J  a  V  \  c. . ;  ,  V  . 


"Analysis  of  surface  boundaries  has  been  extended  to  situations  in  which  a  camera  is  able  to  actively 
track  environmental  surface  points.  Two  problems  were  examined  —  the  determination  of  relative 
depth  at  a  boundary  and  the  determination  of  the  direction  of  motion.  In  both  cases,  the  ability 
to  actively  track  significantly  decreases  the  complexity  of  the  computations  required.  An  analysis 
of  the  computational  basis  for  the  visual  detection  of  moving  objects  has  been  completed.  We  have 
shown  that  moving  object  detection  can  exploit  one  or  more  of  three  general  approaches.  Each 
approach  has  particular  strengths  and  weaknesses.  Two  significant  results  have  been  obtained  in 
the  area  of  motion-based  segmentation.  The  first  combines  motion  and  contrast  information  in 
a  boundary  detection  method  that  is  both  more  reliable  and  more  accurate  than  possible  using 
only  motion  or  only  contrast.  The  integration  is  done  in  a  manner  involving  little  additional 
computation.  Secondly,  we  have  shown  how  motion  information  can  be  used  to  reduce  ambiguity 
in  the  recognition  of  partially  occluded  objects.  V'  *  t  •  > . ,  -  \  \ ^  *  3  o  Lc.V  v  ;  r_, . 


20  DISTRIBUTION /AVAILABILITY  OF  A8STRACT 
SI  UNCLASSlflED/UNUMITiO  □  SAME  AS  RPT  Q  OTIC  USERS 


21.  ABSTRACT  SECURITY  CLASSIFICATION 
Unclassified 


22*.  NAME  OF  RESPONSIBLE  INDIVIDUAL 
Dr.  Abraham  Waksman 


22b.  TELEPHONE  (Include  Area  Code) 
(202)  767-5027 _ 


22c.  OFFICE  SYMBOL 

ki  M 


00  Form  1473,  JUN  86 


Previous  editions  are  obsolete 


SECURITY  CLASSIFICATION  OF  this  page 


§  i 


I 

□ 


FINAL  REPORT  -  STRUCTURE  FROM  MOTION  - 


AFOSR  Contract  AFOSR-87-0168 


a.  Objectives. 


ey — - - 

Dlatr i  but ion/ 


Availability  Codes 


Dlst 


m 


[Avail  and/or 
Special 


Our  principal  objective  continues  to  be  the  development  of  robust  computational  approaches  for 
estimating  the  spatial  organization  of  a  scene  using  time  varying  properties  of  image  sequences. 
Three  closely  related  problems  are  being  pursued: 


•  Active  tracking  of  surface  boundaries. 

Much  attention  is  currently  being  paid  to  problems  involving  active  vision.  An  active  vision 
system  is  able  to  at  least  partially  control  the  manner  in  which  perceptual  information  is 
acquired.  Within  the  context  of  motion,  several  authors  have  argued  that  active  tracking  of 
moving  objects  or  surface  points  provides  additional  constraints  of  use  in  solving  structure- 
from-motion  problems.  We  have  shown  that  this  is  not  in  fact  true.  Active  tracking  can, 
however,  significantly  simplify  some  of  the  computations  involved  in  analyzing  visual  motion. 

•  Moving  object  detection. 

The  detection  of  moving  objects  is  an  important  task  for  many  robotics  applications.  With 
previous  AFOSR  support,  we  developed  a  series  of  algorithms  for  moving  object  detection  in 
a  variety  of  special  situations.  Under  this  contract,  we  have  placed  these  methods  under  a 
coherent  theoretical  framework.  As  a  result,  it  is  now  much  easier  to  determine  the  difficulty 
of  detection  for  a  given  situation  and  to  apply  the  most  appropriate  detection  method. 

•  Motion-based  segmentation. 

We  have  done  extensive  research  on  methods  for  incorporating  motion  into  the  segmentation 
process.  Motion- based  segmentation  is  important  because  it  provides  more  information  than 
methods  using  only  static  cues.  Two  significant  accomplishments  have  been  achieved  under 
this  contract: 

-  Integrating  motion  and  contrast  for  segmentation. 

Motion-based  edge  detection  is  sensitive  only  to  actual  surface  boundaries.  As  a  result, 
ambiguity  is  reduced  over  methods  based  only  on  image  contrast.  Traditional  brightness- 
based  edge  detection  is  far  more  precise  at  localizing  edges,  however.  We  have  shown 
how  edge  detectors  can  be  built  that  naturally  incorporate  the  best  aspects  of  brightness- 
based  and  motion-based  edge  detection. 

-  Occlusion-sensitive  matching. 

Our  most  important  result  under  the  current  contract  deals  with  improvements  in  ob¬ 
ject  recognition  that  are  possible  using  the  results  of  our  motion- based  segmentation 


O  < ) 

**  .  v  r 


;  J  2  r°  6  2  4  a 


i 


t 


t 

N 


technique.  Recognition  in  the  presence  of  occlusion  is  difficult  because  it  is  hard  to  tell 
what  features  are  part  of  the  object  being  analyzed  and  what  features  are  actually  part 
of  other  objects  partially  occluding  the  object  of  interest.  Our  approach  uses  motion  to 
differentiate  between  occluding  and  occluded  surfaces,  and  then  uses  this  information  to 
remove  irrelevant  features  from  the  classification  process. 


b.  Status  of  research  effort. 


Active  tracking  of  surface  boundaries. 

Others  have  argued  that  optical  tracking  of  an  environmental  surface  point  significantly  decreases 
the  intrinsic  complexity  of  various  structure-from-motion  problems.  This  is  not  in  fact  true.  Track¬ 
ing  provides  neither  additional  constraints  nor  other  sorts  of  new  information.  This  is  easily  seen 
by  recognizing  that  all  of  the  information  in  the  tracking  image  is  available  in  an  image  of  the  same 
scene  without  tracking.  Tracking  is  accomplished  by  generating  a  rotation  of  the  eye/camera  sys¬ 
tem  based  on  estimates  of  image  drift  such  as  optical  flow  at  the  image  center.  Once  this  rotational 
velocity  is  determined,  a  non-tracking  image  sequence  can  trivially  be  converted  into  the  equivalent 
tracking  sequence  using  standard  techniques. 

Active  tracking  can  lead  to  important  efficiencies  in  the  implementation  of  certain  structure-from- 
motion  algorithms.  We  have  developed  two  such  methods: 

•  Identification  of  occluding  surface. 

When  a  boundary  element  is  visually  tracked,  the  region  to  the  side  of  the  boundary  corre¬ 
sponding  to  the  occluding  surface  will  have  near-zero  image  flow.  The  region  to  the  side  of  the 
boundary  corresponding  to  the  occluded  surface  will  in  general  be  associated  with  significant 
visual  motion. 

•  Determination  of  direction  of  observer  motion. 

When  a  boundary  element  is  visually  tracked,  optical  flow  due  to  the  more  distant  surface 
indicates  the  direction  of  observer  motion.  The  flow  vectors  point  in  the  direction  of  the 
image  location  corresponding  to  the  line  of  sight  coincident  with  the  direction  of  translational 
motion.  Multiple  fixations  over  the  field  of  view  can  be  used  to  solve  for  the  actual  direction 
of  translation. 

The  first  of  these  techniques  requires  only  the  detection  of  regions  with  significant  image  motion,  a 
fai  easier  tasks  than  the  comparisons  required  by  previously  known  methods.  The  second  technique 
eliminates  difficulties  due  to  camera  rotation  that  plague  most  other  solutions  to  this  problem 
Additional  discussion  is  presented  in  [5]. 


2 


f 


Moving  object  detection. 

The  reliable  detection  of  moving  objects  is  essential  for  many  robotics  applications.  If  the  camera  is 
stationary  and  illumination  constant,  this  can  be  done  by  simple  techniques  which  compare  succes¬ 
sive  image  frames,  looking  for  significant  differences.  If  the  camera  is  moving,  however,  the  problem 
is  considerably  more  difficult.  For  a  moving  camera,  both  moving  objects  and  stationary  portions  of 
the  scene  may  be  changing  position  with  respect  to  the  camera  and  thus  generating  visual  motion 
in  the  imagery.  A  moving  camera  leads  to  difficulties  because  of  the  need  to  determine  objects 
moving  with  respect  to  the  environment,  rather  than  the  much  easier  problem  of  finding  objects 
moving  with  respect  to  the  camera.  General  solutions  based  only  on  vision  are  computationally 
complex  and  likely  to  be  numerically  unstable.  If  partial  information  is  available  about  camera 
motion  and/or  scene  structure,  however,  robust  motion  detection  methods  are  possible. 

We  have  shown  that  possible  approaches  to  this  problem  fall  into  three  categories: 

•  Violations  of  motion  eptpolar  constraint. 

Translational  motion  produces  a  flow  field  radially  expanding  from  a  “focus  of  expansion" 
(FOE).  Any  flow  vectors  violating  this  constraint  are  due  to  moving  objects. 

•  Comparison  of  optical  flow  and  other  depth  information. 

While  patterns  of  optical  do  not  uniquely  specify  depth,  they  do  constrain  the  possible  values. 
Motion-based  constraints  on  possible  depth  can  be  combined  with  static  constraints  obtained 
from  cues  such  as  stereo.  Violations  of  the  combined  constraints  indicate  that  moving  objects 
are  present. 

•  Violations  of  rigid  object  constraint. 

Only  certain  patterns  of  optical  flow  can  correspond  the  the  imagery  produced  by  a  moving, 
rigid,  three-dimensional  object.  While  we  have  not  yet  researched  this  approach  extensively, 
there  is  reason  to  believe  that  it  may  be  possible  to  determine  whether  or  not  this  rigidity 
constraint  is  actually  satisfied.  If  so,  distin  t.  non-rigid  motion  corresponds  to  moving 
objects. 

Understanding  the  theoretical  underpinnings  of  moving  object  detection  has  several  advantages. 
Perhaps  most  importantly,  it  is  now  possible  to  determine  under  what  situations  a  particular 
approach  will  work  without  having  to  examine  the  details  of  a  specific  algorithm.  Likewise,  the 
strengths  and  weaknesses  of  whole  classes  of  algorithms  can  be  investigated  at  one  time.  Finally, 
we  expect  that  better  performing  algorithms  will  arise  from  a  more  complete  understanding  of  the 
basic  constraints  involved  in  the  problem.  More  information  can  be  found  in  [1]. 

Motion-based  segmentation. 

Edge  detection  algorithms  based  on  visual  motion  perform  significantly  differently  than  those  based 
on  brightness.  Previous  attempts  to  combine  motion  and  contrast  information  in  edge  detection 
have  not  recognized  these  differences.  Static  cues  such  as  contrast  edges  give  good  spatial  localiza¬ 
tion.  but  are  subject  to  highly  ambiguous  interpretations.  Visual  motion  is  a  robust  indicator  of 


3 


t 


surface  boundaries,  but  does  not  yield  precise  information  on  the  location  of  the  boundary.  The 
approach  described  in  [4]  accurately  locates  edges  due  to  surface  boundaries,  without  generating 
manv  "false”  edges.  Furthermore,  the  combined  method  adds  minimal  computational  complexity 
to  the  edge  detection  process. 

Our  most  important  result  under  the  current  contract  deal  with  the  problem  of  recognizing  partially 
occluded  objects.  Most  existing  matching  algorithms  that  are  tolerant  of  occlusion  look  for  a  partial 
correspondence  between  model  and  image  features.  If  a  partial  match  is  found,  unmatched  model 
components  are  assumed  to  be  hidden  by  an  occlusion.  This  approach  leads  to  difficulties  because  of 
the  chances  for  partial  matches  occurring  coincidentally.  In  our  method,  motion-based  information 
about  occlusion  boundaries  is  used  to  explicitly  identify  model  features  that  will  not  be  visible  in 
the  image.  Most  of  the  remaining  model  features  should  be  findable  if  the  match  is  in  fact  correct. 
Occluded  model  features  are  determined  based  directly  on  image  properties  at  boundaries,  rather 
than  just  on  the  absence  of  an  image  feature  at  some  expected  location.  The  result  is  a  significant 
decrease  in  ambiguity.  Details  are  found  in  (4j. 


c.  Publications. 


[l!  W.B.  Thompson  and  T.C.  Pong,  “Detecting  Moving  Objects,”  submitted  to  International  Jour¬ 
nal  of  Computer  Vision. 

[2]  W.B.  Thompson  and  T.C.  Pong,  “Detecting  Moving  Objects,”  Proceedings  of  the  First  Inter¬ 
national  Conference  on  Computer  Vision,  June  1987. 

[3]  W.B.  Thompson,  L.G.  Craton,  and  A.  Yonas,  “The  2|-D  Sketch,”  Proceedings  of  the  AAAI 
Workshop  on  Physical  and  Biological  Approaches  to  Computational  Vision,  March  1988. 

[4]  W.B.  Thompson  and  R.P.  Whillock,  “Occlusion-Sensitive  Matching,”  Proceedings  of  the  Second 
International  Conference  on  Computer  Vision,  December  1988. 

[5]  W.B.  Thompson,  “Structure-From-Motion  By  Tracking  Occlusion  Boundaries,”  Proceedings 
IEEE  Workshop  on  Visual  Motion,  March  1989. 


d.  Scientific  Collaborators. 

Research  Assistants:  Collaborating  Faculty: 

King  Chu  Herbert  Pick 

Martin  Kenner  Ting-Chuen  Pong 

Steven  Savitt  Albert  Yonas 

Elizabeth  Stuck 
Rand  Whillock 


4 


Detecting  Moving  Objects 


William  B.  Thompson 
Ting-Chuen  Pong 

Computer  Science  Department 
University  of  Minnesota 
Minneapolis.  MN  55455 


Submitted  to  the  International  Journal  of  Computer  Vision. 

Abstract 

The  detection  of  moving  objects  is  important  in  many  tasks.  This  paper  examines 
moving  object  detection  based  primarily  on  optical  flow.  We  conclude  that  in  realistic 
situations,  detection  using  visual  information  alone  is  quite  difficult,  particularly  when 
the  camera  may  also  moving.  The  availability  of  additional  information  about  camera 
motion  and/or  scene  structure  greatly  simplifies  the  problem.  Two  general  classes  of 
techniques  are  examined.  The  first  is  based  around  the  motion  epipolar  constraint  - 
translational  motion  produces  a  flow  field  radially  expanding  from  a  “focus  of  expansion” 
(FOE).  The  second  class  of  methods  is  based  on  comparing  observed  optical  flow  with 
other  information  about  depth.  Examples  of  several  of  these  techniques  are  presented. 


1  Introduction. 


One  important  function  of  a  vision  system  is  to  recognize  the  presence  of  moving  objects  in  a  scene. 
If  the  camera  is  stationary  and  illumination  constant,  this  can  be  done  by  simple  techniques  which 
compare  successive  image  frames,  looking  for  significant  differences.  If  the  camera  is  moving,  the 
problem  is  considerably  more  complex.  For  the  purposes  of  this  discussion,  moving  objects  are 
taken  to  be  any  objects  moving  with  respect  to  the  stationary  portions  of  the  scene,  which  we  refer 
to  as  the  environment.  For  a  moving  camera,  both  moving  objects  and  stationary  portions  of  the 
scene  may  be  changing  position  with  respect  to  the  camera  and  thus  generating  visual  motion  in 
the  imagery.  A  moving  camera  leads  to  difficulties  because  of  the  need  to  determine  objects  moving 
with  respect  to  the  environment,  rather  than  the  much  easier  problem  of  finding  objects  moving 


This  work  «u  supported  by  AFOSR  contract  AFOSR-87-0I68  and  NSF  Grants  DCR-8500899  and  IR1-8722576. 

A  preliminary  version  of  this  paper  appeared  in  The  Proceedings  of  the  First  International  Conference  on  Com¬ 
puter  Vision,  London,  June  1987. 


1 


with  respect  to  the  camera.  In  this  paper,  we  deal  with  the  problem  of  detecting  moving  objects 
from  a  moving  camera  based  on  optical  flow. 


The  visual  detection  of  moving  objects  is  a  surprisingly  difficult  task.  A  simple  example 
illustrates  just  how  serious  the  problem  can  be.  Consider  the  optical  flow  field  shown  in  figure  1 
which  appears  to  show  a  small,  square  region  in  the  center  of  the  image  moving  to  the  right  and 
surrounded  by  an  apparently  stationary  background.  Such  a  flow  field  can  arise  from  several  equally 
plausible  situations:  1)  The  camera  is  stationary  with  respect  to  the  environment,  and  the  central 
region  corresponds  to  an  object  moving  to  the  right.  2)  The  camera  is  moving  to  the  left  with 
respect  to  the  environment,  most  of  the  environment  is  sufficiently  distant  so  that  the  generated 
optical  flow  is  effectively  zero,  while  the  central  region  corresponds  to  a  surface  near  to  the  camera 
but  stationary  with  respect  to  the  environment.  3)  The  camera  and  object  are  moving  with  respect 
to  both  the  environment  and  each  other,  though  the  environment  is  sufficiently  distant  so  that 
there  is  no  perceived  optical  flow.  It  is  not  possible  to  tell  whether  or  not  this  seemingly  simple 
pattern  corresponds  to  a  moving  object!1 

Figure  1  provides  one  example  of  why  a  general  and  reliable  solution  to  the  problem  of  moving 
object  detection  based  only  on  optical  flow  is  not  feasible.  Robust  solutions  require  that  additional 
information  about  camera  motion  and/or  scene  structure  be  available.  In  this  paper,  we  examine  a 
variety  of  types  of  information  that  might  be  available.  Each  information  source  places  constraints 
on  the  optical  flow  fields  that  can  be  generated  by  a  camera  moving  through  an  otherwise  static 
environment.  Violations  of  these  constraints  are  thus  necessarily  due  to  moving  objects. 

1  The  flow  pattern  in  figure  1  provides  little  information  about  actual  camera  motion.  Apparently  stationary  image 
regions  can  be  due  to  the  viewing  of  distant  surfaces  and/or  rotational  motion  that  tracks  a  surface  point,  keeping 
it  at  a  fixed  point  in  the  field  of  view.  Even  with  significant  non-iero  flow  existing  over  the  whole  of  the  image, 
ambiguities  exist  between  flow  patterns  due  to  translational  motion  and  due  to  rotational  motion  [1]. 


2  Background. 


An  extensive  literature  has  developed  on  computational  approaches  to  the  analysis  of  visual  motion 
ie.g..  see  '2').  The  majority  of  this  work  deals  with  what  Ullman  [3]  has  called  the  structure- 
from-motion  and  motion-from-structure  problems.  Visual  motion  is  used  to  determine  the  three- 
dimensional  position  of  surface  points  under  view  and/or  the  parameters  of  motion  relating  camera 
and  object.  Almost  without  exception,  papers  describing  structure-from-motion  and  motion-from- 
structure  algorithms  deal  only  with  a  single,  rigid  object  in  the  field  of  view.  If  the  problem  of 
separately  moving  objects  is  mentioned  at  all.  it  is  in  a  comment  that  the  image  must  be  segmented 
into  separately  moving  objects  before  the  method  being  described  is  applied. 

Some  work  has  been  done  on  the  segmentation  of  images  based  on  visual  motion.  The  easiest 
form  of  this  problem  occurs  with  a  camera  known  to  be  stationary.  In  such  circumstances,  object 
motion  leads  to  significant  temporal  differences  in  an  image  sequence.  Such  differences  correspond 
to  moving  objects,  and  furthermore  can  be  used  to  estimate  the  boundaries  of  the  objects  (e.g., 
[4.  5j).  More  classical  edge-detection  techniques  can  also  be  applied  to  time-varying  imagery  [6,  7, 
S.  9.  10.  11].  Such  approaches  work  for  both  moving  and  stationary  cameras.  When  the  camera  is 
moving,  however,  sharp  spatial  changes  in  visual  motion  can  correspond  to  either  the  boundaries 
of  moving  objects  or  to  depth  discontinuities  between  two  rigidly  attached  surfaces.  As  a  result, 
motion-based  edge  detection  is  not  sufficient  to  detect  moving  objects. 

Jain  is  one  of  the  few  researchers  to  deal  directly  with  the  problem  of  detecting  moving  objects 
using  a  moving  camera  [8].  His  approach  exploits  the  motion  epipolar  constraint  which  says  that 
for  translational  camera  motion  with  respect  to  a  static  environment,  optical  flow  will  expand 
radially  from  a  focus  of  expansion  corresponding  to  the  direction  of  translation.  For  translational 
motion,  any  flow  values  violating  the  epipolar  constraint  must  be  due  to  moving  objects  in  the 
scene.  Unfortunately,  this  approach  requires  knowledge  of  the  direction  of  translation  and  does  not 
work  if  the  motion  has  a  rotational  component. 


3  Possible  approaches. 

At  least  three  general  approaches  to  moving  object  detection  are  possible.  Each  exploits  a  particular 
constraint  that  must  hold  if  a  camera  is  moving  through  an  otherwise  static  environment.  Detecting 
moving  objects  becomes  equivalent  to  a  search  for  violated  constraints. 

•  Motion  epipolar  constraint. 

Translational  camera  motion  produces  a  distinctive  optical  flow  pattern.  Flow  vectors  appear 
to  radiate  out  from  a  “focus  of  expansion”  (FOE)  corresponding  to  the  line  of  sight  coincident 
with  the  direction  of  motion.  This  has  the  effect  of  constraining  the  orientation  of  flow  vectors. 
Visual  motion  which  violates  this  orientational  constraint  must  be  due  to  moving  objects. 
Under  some  circumstances,  the  motion  epipolar  constraint  may  still  be  used  when  camera 


3 


rotation  is  added  to  the  translational  movement. 

•  Depth/flow  constraint. 

The  optical  flow  generated  by  a  surface  point  is  a  function  of  the  relative  motion  between 
camera  and  surface  and  of  the  range  to  the  surface.  If  range  values  are  available,  then 
inconsistencies  between  optical  flow,  range,  and  observer  motion  signal  moving  objects. 

•  Rigidity  constraint. 

A  scene  containing  moving  objects  can  be  thought  of  as  undergoing  non-rigid  motion  with 
respect  to  the  camera.  Structure-from-motion  techniques  which  are  sensitive  to  the  presence 
of  non-rigid  motion  can  thus  be  used  to  detect  moving  objects. 

This  paper  will  concentrate  on  epipolar  and  depth/flow  methods.  Though  potentially  effec¬ 
tive,  methods  based  directly  on  the  rigidity  constraint  require  longer  frame  sequences,  temporal 
derivatives  of  optical  flow,  and/or  a  wide  field  of  view  to  enhance  perspective  effects. 


4  Presumptions. 

Many  theoretically  plausible  techniques  for  analyzing  visual  motion  are  ineffective  in  practice. 
Typically,  the  assumptions  on  which  these  techniques  are  either  explicitly  or  implicitly  founded  do 
not  accurately  represent  real  problems.  For  this  work,  we  start  with  the  presumption  that  motion 
detection  algorithms  should  be  designed  with  the  following  properties  in  mind: 

•  The  field  of  view  may  be  relatively  narrow. 

Motion  detection  should  not  depend  on  the  use  of  wide  angle  imaging  systems.  Such  systems 
may  not  be  available  in  a  particular  situation,  and  if  used  may  increase  the  difficulty  or 
recognizing  small  moving  objects.  As  a  result,  detection  algorithms  should  not  depend  on 
subtle  properties  of  perspective. 

•  The  image  of  moving  objects  may  be  small  with  respect  to  the  field  of  view. 

This  is  clearly  desirable  for  reliability.  Moving  objects  may  be  far  away  and  subtended  by 
relatively  small  visual  angles.  We  need  methods  capable  of  identifying  single  image  points,  or 
at  least  small  collections  of  points,  as  corresponding  to  moving  objects.  Detection  algorithms 
thus  cannot  depend  on  variations  in  flow  over  a  potentially  moving  object. 

•  Estimated  optical  flow  fields  will  be  noisy. 

No  method  is  capable  of  estimating  optical  flow  with  arbitrary  accuracy.  Motion  detection 
based  on  optical  flow  must  be  tolerant  of  noisy  input. 


4 


5  The  Optical  Flow  Equation. 


The  basic  mathematics  governing  the  optical  flow  generated  by  a  moving  camera  is  well  known. 
Our  notation  is  similar  to  [12].  using  a  coordinate  system  fixed  to  the  camera  (e.g.,  the  world  can  be 
thought  of  as  moving  by  a  stationary  camera).  Optical  flow  values  are  a  function  of  image  Location, 
the  relative  motion  between  the  camera  and  the  surface  point  corresponding  to  the  image  location, 
and  the  distance  from  the  camera  to  the  corresponding  surface  point: 

F(p)  =  — — +  F,(p)  (1) 

np) 

F t  =  (-U  +  xW  .  -V  +  yW)  (2) 

Fr  =  (.Axy  -  B(x2  +  1)  +  Cy  ,  A(y2  +  1)  -  Bxy  -  Cx))  (3) 

where  F  is  the  optical  flow  at  image  location  p  =  (x.y),  x  and  y  are  normalized  by  the  focal  length. 
r(p)  is  the  range  from  the  camera  to  the  surface  point  imaged  at  p,  T  =  ( U,V,W)T  specifies  the 
translational  velocity  of  the  camera,  and  u  =  ( A,B,C)T  specifies  camera  rotation. 

Most  work  on  the  analysis  of  optical  flow  has  dealt  with  a  camera  moving  through  an  otherwise 
static  environment  or,  equivalently,  a  single  rigid  object  moving  in  front  of  a  fixed  camera.  In  such 
cases,  single  values  of  T  and  w  govern  the  flow  over  the  whole  image.  If  moving  objects  are  present, 
then  the  relative  motion  between  camera  and  environment  will  be  different  than  the  relative  motion 
between  camera  and  moving  object.  Notationally,  we  will  specify  the  camera  motion  with  respect 
to  the  environment  by  T*enw)  and  o/(enu\  The  parameters  specifying  the  relative  motion  between 
the  camera  and  an  arbitrary  scene  point  p  will  be  indicated  by  T^p)  and  u>(p).  p  lies  on  a  moving 
object  if  T(P>  #  T(enw)  and/or  u>(P>  ±  u/env>. 


6  Detection  based  on  Epipolar  Constraint. 


If  complete  information  about  instantaneous  camera  motion  is  available,  then  T^enu^  and  u>(enw*  are 
known.  If  the  camera  is  translating  but  not  rotating  with  respect  to  the  background,  u>(env)  =  o, 
Fr  =  0,  and  all  flow  vectors  due  to  the  moving  image  of  the  background  will  radiate  away  from  a 
focus  of  expansion  (FOE).  From  equations  1  and  2,  it  is  easy  to  see  that  the  image  plane  location 
of  the  FOE  is  at:  /  U  V 

(®»y )jot  “ 

While  the  location  of  the  FOE  depends  only  on  the  direction  of  translation  and  not  on  the  speed,  it 
is  important  for  detectability  that  the  speed  be  sufficient  to  generate  measurable  optical  flow.  The 
FOE  is  not  restricted  to  lie  within  the  visible  portion  of  the  image  (and  in  fact  may  be  a  focus  of 
contraction).  An  FOE  at  oc  corresponds  to  pure  lateral  motion,  which  generates  a  parallel  optical 
flow  pattern. 


5 


6.1  Direct  use  of  motion  epipolar  constraint. 


For  pure  translational  motion,  the  direction  of  motion  specifies  the  direction  of  optical  flow  associ¬ 
ated  with  any  surface  point  stationary  with  respect  to  the  environment: 


„  V  -  Wy 

9/0 '  ~  tan  u- 


(5) 


where  0fJe  is  the  expected  flow  orientation  at  the  point  (x.y).  predicted  using  the  motion  epipolar 
constraint.  .Vote  that  this  equation  is  still  well  defined  when  W  =  0.  corresponding  to  a  focus 
of  expansion  at  oc  in  image  coordinates.  Any  flow  values  with  a  significantly  different  direction 
correspond  to  moving  objects  [3].  (The  converse  is  not  necessarily  true.  It  is  possible  that  moving 
objects  coincidentally  generate  flow  values  compatible  with  this  constraint.)  This  approach  requires 
the  estimation  of  only  the  direction  of  flow,  not  either  the  magnitude  or  spatial  variation  of  flow. 


Camera  rotation  introduces  considerable  complexity.  Knowledge  of  camera  motion  no  longer 
constrains  the  direction  of  background  flow.  Nevertheless,  at  a  given  point  p,  flow  is  constrainted  to 
a  one-dimensional  family  of  possible  vector  values.  The  family  is  given  by  (1-3)  where  r  ranges  over 
all  positive  values.  The  analysis  can  be  simplified  because  of  the  linear  nature  of  (1).  Fr  depends 
only  on  the  parameters  of  rotation  and  not  on  any  shape  property  of  the  environment.  Because 
the  value  of  Fr  at  a  particular  point  p  does  not  depend  on  r(p),  it  can  be  predicted  knowing  only 
cj.  At  every  point  within  the  field  of  view,  this  value  can  be  subtracted  from  the  observed  optical 
flow  field,  leaving  a  translational  flow  field: 

F tram  =  F  -  Fr  (6) 

This  field  behaves  just  as  if  no  rotation  was  occurring,  and  thus  moving  objects  can  be  located  using 
the  FOE  technique  described  above.  For  the  remainder  of  this  paper,  when  rotation  is  present,  we 
will  take  the  term  FOE  to  refer  to  the  focus  of  expansion  of  this  translational  field. 

In  principle,  even  if  camera  motion  is  not  known  T^e,lv^  and  w(en,d  may  be  estimated  from 
the  imagery  (e.g.,  [12]),  subject  to  a  positive,  multiplicative  scale  factor  for  Two  serious 

problems  exist,  however.  Narrow  angles  of  view  make  estimation  of  camera  motion  difficult,  as 
significantly  different  parameters  of  motion  and  surface  shape  can  yield  nearly  identical  optical 
flow  patters  [l].  In  addition,  techniques  such  as  [12]  uses  a  global  minimization  approach  which  will 
not  perform  well  if  moving  objects  make  up  a  substantial  portion  of  the  field  of  view.  A  clustering 
approach  (e.g.,  [13])  can  be  made  tolerant  of  the  moving  objects,  though  great  difficulty  can  be 
expected  dealing  with  a  five  dimensional  cluster  space. 


6.2  Indirect  use  of  motion  epipolar  constraint. 

The  motion  epipolar  constraint  has  an  important  implication  for  motion  analysis  methods  that 
operate  only  over  small  image  neighborhoods.  Away  from  the  FOE,  Ft(p)  and  Fr(p)  vary  slowly 
with  p  (equations  2  and  3).  Over  a  small  neighborhood,  both  F<(p)  and  Fr(p)  are  essentially 


6 


constant.  As  a  result,  over  a  small  neighborhood,  the  component  of  flow  due  to  rotational  motion  is 
essentially  constant,  while  the  translational  flow.  F tranj,  varies  only  by  a  scalar  multiple  dependent 
on  depth.  That  is.  over  the  neighborhood  F!ranj  is  essentially  constant  in  direction.  We  can  use  this 
result  to  simplify  problems  arising  from  rotational  camera  motion.  In  one  technique,  we  explicitly 
compensate  for  rotation.  In  a  second  technique,  active  tracking  of  potentially  moving  objects  leads 
to  a  particularly  simple  computational  scheme. 


6.2.1  Known  rotation. 

Often,  information  about  camera  rotation  is  available,  even  when  the  direction  of  translation  is 
not  known.  Non-.isual  information  about  camera  motion  often  comes  from  inertial  sources.  Such 
sources  are  much  more  accurate  in  determining  rotation  than  translation.  Rotation  involves  a 
continuous  acceleration  w'hich  is  easily  measured.  The  determination  of  translation  requires  the 
integration  of  accelerations,  along  with  a  starting  boundary  value.  Errors  in  estimated  translation 
values  rapidly  accumulate.  A  simple  technique  allows  the  detection  of  moving  objects  when  only 
camera  rotation  is  known. 

If  all  motion  parameters  are  known,  knowledge  of  camera  rotation  makes  it  possible  to  compute 
the  translational  flow  field,  F(ran».  Knowledge  of  translation  can  then  used  to  locate  the  FOE  and 
thus  constraint  the  direction  of  flow  vectors  associated  with  the  environment.  If  only  rotation  is 
known,  it  is  still  possible  to  determine  the  translational  flow  field,  but  not  the  FOE.  Visual  methods 
an  be  applied  to  the  translational  flow  field  to  estimate  the  location  of  the  FOE,  but  these  methods 
suffer  from  a  number  of  practical  limitations  when  applied  to  noisy  data. 

An  alternate  approach  can  be  used  which  does  not  require  the  prior  determination  of  the  FOE. 
The  translational  flow  field  extends  radially  from  the  focus  of  expansion.  From  the  arguments  given 
above,  we  know  that  over  any  local  area  away  from  the  FOE,  variations  in  the  direction  (but  not 
necessarily  magnitude)  of  the  translational  flow  field  will  be  small.  Flow  arising  due  to  moving 
objects  is  of  course  not  subject  to  this  restriction.  The  gradient  of  flow  field  direction  can  thus 
be  used  to  detect  the  boundaries  of  moving  objects.  At  these  boundaries,  flow  direction  will  vary 
discontinuously2 

A  complementary  technique  is  available  to  deal  with  situations  in  which  translation  but  not 
rotation  is  known.  We  can  expect  these  situations  to  be  rare,  however.  If  the  direction  of  translation 
were  known  over  some  interval  of  time,  it  would  be  an  easy  matter  to  determine  the  rotation  by 
examining  the  rate  of  change  of  direction. 

JMarr  [14]  claim*  “if  direction  of  [vijaai]  motion  ia  ever  discontinuoo*  at  more  than  one  point  -  along  a  line,  for 
example,  -  then  an  object  boundary  is  present.”  Note  that  this  is  only  necessarily  true  if  no  camera  rotation  is 
occurring  (or  equivalently,  if  camera  rotation  has  been  normalized  by  using  the  translational  flow  field). 


i 


6.2.2 


Active  tracking. 


A  vision  system  which  can  actively  control  camera  direction  is  capable  of  tracking  regions  of  interest 
over  time,  keeping  some  particular  object  centered  within  the  field  of  view.  Tracking  regions 
of  interest  is  desirable  for  many  reasons  other  than  the  detection  of  moving  objects  (e.g.,  [15]), 
though  the  analysis  of  imagery  arising  from  a  tracking  camera  has  not  received  much  study  by  the 
computer  vision  community.  If  there  are  significant  variations  in  depth  over  the  visible  portion  ot 
the  background  and  if  moving  objects  are  relatively  small  with  respect  to  the  field  of  view,  then 
moving  object  detection  based  on  tracking  can  be  accomplished  without  any  actual  knowledge  of 
camera  motion.  (For  motion  detection,  the  tracking  can  easily  be  simulated  if  the  camera  is  not 
actively  controllable.) 


If  an  object  is  being  tracked,  then  its  optical  flow  is  zero.3  Flow  based  methods  for  determining 
whether  or  not  a  tracked  object  is  moving  must  depend  wholly  on  the  patterns  of  flow  in  the 
background.  Object  tracking  helps  in  moving  object  detection  because  it  minimizes  many  of  the 
difficulties  due  to  camera  rotation.  When  dealing  with  instantaneous  flow  fields,  we  can  decompose 
the  problem  by  considering  all  translational  motion  to  be  due  to  movement  of  the  camera  platform 
and  all  rotational  motion  due  to  pan  and  tilt  of  the  camera  to  accomplish  the  tracking.  (We  will 
disregard  any  effects  due  to  spin  around  the  line  of  sight.)  Consider  the  effect  of  tracking  a  point 
that  is  in  fact  part  of  the  environment.  Tracking  is  effected  by  generating  a  rotational  motion  that 
exactly  compensates  for  the  translational  flow  at  the  center  of  the  image.  This  is  accomplished  by 
choosing  Fr  such  that: 

<7) 

For  a  small  enough  neighborhood,  Ft  and  Fr  can  be  treated  as  constant,  leading  to  the  following 
flow  equation: 


IWp)-  (kpTk^oJ- 


F, 


(8) 


The  effect  on  the  optical  flow  field  is  that  in  the  neighborhood  of  the  tracked  point,  the  direction  of 
flow  will  be  approximately  constant  (modulo  180°),  with  a  magnitude  dependent  on  the  difference 
between  the  range  to  the  corresponding  surface  point  and  the  range  to  the  tracked  point. 


Now,  consider  tracking  a  point  that  is  moving  with  respect  to  the  environment.  If  environ¬ 
mental  surface  points  are  visible  in  the  neighborhood  of  the  tracked  point,  F<  and  Fr  are  no  longer 
constant  within  the  neighborhood.  For  environmental  points: 


F ,  , 

Fi'“*(p)=T(5r+r' 


p{(o4;ec<) 

r(0,  0) 


(9) 


F((env),  and  j 't(oi>/ect)  wm  jn  generaj  differ  in  orientation.  If  there  is  a  variation  in  range  to 

visible  environmental  points,  then  there  will  be  a  variation  in  direction  of  observed  flow  over  the 
neighborhood.  (Note  that  detection  is  not  possible  if  there  is  no  variation  in  r(p)  over  the  visible 
environment.  This  situation  is  similar  to  that  depicted  in  figure  1.) 

JTo  simplify  discussion,  we  ignore  the  case  of  an  object  rotating  in  depth.  The  method  developed  does  in  fact 
deal  effectively  with  this  situation. 


8 


Figures  2  and  3  illustrate  the  effect.  Figure  2  shows  the  optical  flow  over  a  neighborhood  in 
which  no  motion  is  occurring  with  respect  to  the  environment.  Figure  2a  shows  the  flow  before  any 
tracking  motions  are  initiated.  The  dashed  line  indicates  the  translational  component  of  flow.  The 
rotational  component  of  flow  is  indicated  by  the  dotted  line.  The  solid  line  is  the  observed  optical 
flow,  the  sum  of  the  translational  and  rotational  components.  The  translational  components  are 
parallel.  The  variations  in  magnitude  correspond  to  underlying  variations  in  range.  The  rotational 
components  are  constant  over  the  neighborhood.  Note  that  the  observed  flow  varies  in  orientation 
-  as  previously  indicated,  orientational  variability  alone  is  not  enough  to  detect  moving  objects. 
Figure  2b  shows  the  flow  that  results  when  the  point  in  the  center  of  the  region  is  being  tracked. 
The  center  flow  is  of  course  zero.  The  dashed  lines  now  indicate  the  flow  that  would  be  observed 
without  tracking.  The  dotted  lines  indicate  the  rotational  flow  that  is  introduced  to  stablize  the 
center  point  withing  the  field  of  view.  The  solid  line  shows  the  resulting  optical  flow.  Note  that 
the  flow  vectors  are  parallel,  but  in  this  case  differ  by  180°. 


Figure  2:  Tracking  a  Stationary  Surface  Point. 

Figure  3  shows  the  same  flow  vectors  in  the  case  where  the  center  point  corresponds  to  a  moving 
object  and  the  two  other  points  correspond  to  portions  of  the  environment.  Note  that  in  figure  3a, 
the  translational  flow  varies  significantly  in  orientation.  If  we  actually  knew  the  translational  flow, 
this  fact  would  be  enough  to  determine  that  a  moving  object  was  present.  Without  information 
about  camera  rotation,  however,  we  must  resort  to  more  indirect  methods. 


7  Detection  Based  on  Flow/Depth  Constraint. 


lecently,  efforts  have  been  made  at  developing  integrated  approaches  to  analyzing  stereo  and 
motion  (e.g.,  [6,  16]).  These  approaches  simultaneously  deal  with  motion  and  stereo  disparity, 
either  by  comparing  flow  fields  taken  from  different  viewing  positions  or  by  establishing  point 
correspondences  over  both  time  and  viewing  directions.  Similar  multi-cue  analysis  can  greatly 
aid  in  the  detection  of  moving  objects.  We  claim,  however,  that  it  is  not  necessary  to  adopt  a 


9 


Figure  3:  Tracking  a  Moving  Object. 


strategy  requiring  the  unified  low-level  integration  of  motion  and  stereo.  Rather,  depth  estimates 
from  whatever  sources  are  available  can  be  used.  In  addition  to  stereo,  these  sources  can  include 
the  full  range  of  non-motion  depth  cues:  familiar  size,  focus,  gradients  of  various  properties,  aerial 
perspective,  and  many  more  [17].  Furthermore,  while  precise  estimates  of  depth  are  obviously 
useful,  relative  depth  or  coarse  approximations  to  depth  can  also  aid  in  the  analysis. 


7.1  Objects  moving  on  surfaces. 


Knowledge  of  the  shape  of  environmental  surfaces  can  be  used  to  simplify  the  motion  detection 
problem.  Scene  structure  may  be  known  precisely  (e.g.,  the  range  to  visible  surface  points)  or  in 
terms  of  general  properities  (e.g.,  significant  depth  discontinuities  can  be  expected).  If  moving 
objects  must  remain  in  contact  with  environmental  surfaces  (e.g.,  vehicular  motion),  a  less  complex 
technique  depending  only  on  knowing  the  image  plane  locations  corresponding  to  discontinuities 
in  range  is  possible.  If  no  objects  are  moving  within  the  field  of  view,  equations  (1-3)  show  that 
flow  varies  inversely  with  distance  for  fixed  p.  Both  Fr  and  Ft  vary  slowly  (and  continuously)  with 
p.  Discontinuities  in  F  thus  correspond  to  discontinuities  in  r.  This  relationship  holds  only  for 
relative  motion  between  the  camera  and  a  single,  rigid  structure.  When  multiple  moving  objects 
are  present,  equation  1  must  be  modified  so  that  there  is  a  separate  Fr^  and  Ft^  specifying  the 
relative  motion  between  the  sensor  and  each  rigid  object.  Discontinuities  in  flow  can  now  arise 
either  due  to  a  discontinuity  in  range  or  due  to  the  boundaries  of  a  moving  object.  If  independent 
information  is  available  on  the  location  of  range  discontinuities,  and  other  discontinuities  in  flow 
must  be  due  to  moving  objects. 

The  motion  detection  problem  becomes  particularly  simple  if  the  environment  is  planar.  In 
this  case,  depth  discontinuities  are  not  possible  and  any  discontinuity  in  flow  (either  direction  or 
magnitude)  corresponds  to  the  boundary  of  a  moving  object.  Note  that  it  is  not  sufficient  to  know 


10 


simply  that  the  environment  is  a  ‘‘smooth"  surface.  From  some  viewing  positions,  even  smooth 
surfaces  may  exhibit  range  discontinuities. 


7.2  Direct  comparison  of  depth  and  flow. 

A  simple  wav  of  combining  depth  and  visual  motion  to  detect  moving  objects  is  possible  if  accurate 
3-D  position  information  is  available  for  a  sufficient  number  of  surface  points  in  the  environment 
and  on  any  moving  objects.  If  both  the  optical  flow  and  the  depth  are  known  for  a  collection 
of  surface  points  in  the  environment,  then  equations  (l)-{3)  can  be  used  to  create  a  system  of 
equations  which  can  be  solved  for  the  parameters  of  motion  T^enu*  and  u>^env).  (Knowing  depth 
values  makes  this  an  easier  task  than  the  standard  structure-from-motion  problem.)  If  the  collection 
of  points  includes  some  values  associated  with  the  environment  and  others  associated  with  one  or 
more  objects  moving  with  respect  to  the  environment,  the  system  of  equations  used  to  solve  for 
T  and  will  be  inconsistent.  Checking  the  system  for  consistency  can  therefore  be  used  as  a  test 
for  the  presence  of  a  moving  object  (e.g.,  a  test  for  non-rigid  motion  in  the  field  of  view).  Only 
the  consistency  of  the  system  is  important.  The  actual  values  of  T  and  w  are  not  relevant  to  the 
detection  problem. 


7.3  Indirect  comparison  of  depth  and  flow. 


The  availability  of  accurate  3-D  position  estimates  depends  in  large  part  on  having  accurately 
calibrated  camera  systems.  Not  only  is  this  calibration  difficult,  but  it  is  continuously  subject  to 
variability  due  to  meet  nical  compliance.  Relative  measures  of  visible  motion  and/or  stereo  can 
be  used  to  avoid  this  calibration  problem  (e.g.,  [18]).  For  example,  Reiger  and  Lawton  have  shown 
how  to  use  local  spatial  differences  to  minimize  difficulties  due  to  rotation  [19].  If  no  moving  objects 
are  visible,  then  large  local  differences  in  flow  can  only  be  due  to  a  change  in  depth.  If  p(')  and 
p(J)  are  image  points  on  either  side  of  such  a  boundary,  then  from  equation  (1)  we  have: 


AF  = 


F,(p«)  -  FP(pW)  + 


F«(p(,~>) 

r(  P(i)) 


F,(pW) 

r(P(i)) 


(10) 


If  p(')  and  p^  are  sufficiently  close,  Fr(pW)  as  Fr(p^)  and  Fe(p^)  as  F<(p^).  As  a  result  the 
rotational  component  of  flow  cancels  out  in  the  spatial  difference  and: 

AF»||Ft(p)A(i)||  (11) 

That  is,  the  difference  in  flow  across  the  edge  is  proportional  to  the  difference  of  the  reciprocal  of 
depth  across  the  edge.  The  relationship  between  stereo  disparity  and  depth  is  very  similar  to  the 
relationship  between  optical  flow  and  depth: 

d(  P)  =  (12> 


11 


where  d( p)  is  the  stereo  disparity  at  p,  dv  is  a  term  dependent  on  the  camera  vergence.  and  di,  is 
a  term  dependent  on  the  baseline  separating  the  cameras.  Using  the  same  argument  as  above,  we 


have: 


A  d 


<Wp)A(;)f 


(13) 


Over  a  local  neighborhood.  F(  and  di,  will  remain  essentially  constant,  while  Ap  will  generally  vary. 
Dividing  equation  (11)  by  equation  (13)  shows  that  the  ratio  of  AF  to  Ad  remains  constant,  as 
long  as  the  points  over  which  the  differences  are  taken  are  the  same  for  flow  and  disparity. 


Flow  boundaries  associated  with  moving  objects  are  not  subject  to  this  constraint.  As  a  result 
we  can  detect  moving  objects  by  looking  for  local  neighborhoods  over  which  the  ratio  AF/Ad  varies 
significantly.  We  never  have  to  solve  for  the  actual  depth,  nor  do  we  need  to  know  the  functions 
Fj,  Fr,  d„,  or  d»,.  The  solution  does  not  depend  on  information  about  camera  motion  or  relative 
camera  geometry.  For  this  approach  to  work,  however,  there  has  to  be  significant  changes  in  depth 
over  the  background,  not  just  between  the  background  and  any  moving  objects.  There  is  reason  to 
believe  that  such  variation  is  important  to  a  large  class  of  moving  object  detection  algorithms. 


8  Examples. 


.All  of  the  methods  described  in  sections  6  and  7  have  been  tested  experimentally.  Four  examples 
are  presented  below,  all  involving  a  moving  camera  and  potentially  moving  objects.  Two  cases 
exploit  the  epipolar  constraint.  The  first  of  these  involves  a  situation  in  which  camera  rotation 
is  known,  but  not  camera  translation.  In  the  second  case,  a  potentially  moving  object  is  being 
actively  tracked.  Results  are  also  presented  for  two  methods  utilizing  constraints  resulting  from 
the  comparison  of  depth  and  flow.  The  simplest  of  these  involves  objects  moving  over  a  smooth 
environment.  The  final  example  compares  flow  and  disparity  across  boundaries  of  possibly  moving 
objects,  using  the  technique  of  section  7.3. 

Figure  4  shows  the  first  frame  in  a  sequence  of  of  images  of  an  outdoor  scene.  In  this  example, 
the  camera  rotates  and  translates  with  respect  to  the  environment  while  the  toy  vehicle  moves 
to  the  right  between  image  frames.  The  rotational  velocity  of  the  camera  with  respect  to  the 
environment  was  measured.  The  optical  flow  field  shown  in  figure  5  was  obtained  by  the  token 
matching  technique  described  in  [20].  The  translational  flow  field  shown  in  figure  6  was  obtained 
by  subtracting  the  rotational  flow  component  computed  from  the  known  rotational  velocity  from 
the  observed  optical  flow  field  (figure  5).  The  gradient  of  flow  direction  in  the  translational  flow 
field  was  used  to  detect  the  boundaries  of  moving  objects.  Figure  7  shows  the  detected  boundary 
of  a  moving  object  overlaid  onto  the  first  frame  of  figure  4. 


12 


-C!%  T-* 


,Ar. 


'  -  ■  »•  -■ 

>i  L  y^| 


■'  _A  •  •  -  -  •  ^  /i  ifV  '  »  ■  ,f 

-  :  '  y  '  ^  -£r~;  / 'v  V  ‘V\ ' 1  *  V <v'v '  V 


'Wrld 

*vtl 


Figure  5:  Optical  flow  field  obtained  from  the  image  sequence  of  figure  4. 


13 


Figure  6:  Translational  flow  field  determined  from  the  optical  flow  field  of  figure  5. 


14 


In  figure  $  :he  mechanical  toy  -mature  in  the  comer  of  rhe  image  is  being  tracked  by 
the  camera  while  tr.e  camera  .a  :ra..3.at;ng  ro  cite  left  wica  respect  to  the  environment. 
Figure  Q  shows  the  estimated  cmicai  flow  Farm  10  shews  a  histogram  of  the  directions 
of  the  optica;  ; w  N'e  'ha:  there  an  two  i^cn.-r  p.-aks  in  the  histogram.  The  varia¬ 
tion  in  flew  i.rect.tn  over  'he  image  was  compu’e-i  to  be  approximately  34  degrees,  indi¬ 
cating  that  'he  ;h  ec*  was  m  fact  mo-,  .no. 


Figure  S:  First  frame  of  tracking  sequence. 


Figure  9:  Optical  flow  field  obtained  from  the  image  sequence  of  figure  8. 


13 


Figure  10:  Histogram  of  the  flow  direction  of  the  optical  flow  vectors  in  figure  9. 


As  a  comparison,  a  similar  experiment  in  which  the  tracked  object,  a  rock,  is  stationary  with 
respect  to  the  environment  while  the  camera  is  moving  was  also  preformed.  A  pair  of  images  similar 
to  that  of  figure  8  were  obtained.  The  resulting  estimated  optical  flow  field  is  shown  in  figure  11.  Its 
corresponding  histogram  is  shown  in  figure  12.  Note  that  only  one  distinct  peak  is  observed  in  this 
histogram.  The  global  variation  in  flow  direction  in  this  case  was  computed  to  be  approximately 
11°  which  is  significantly  smaller  than  that  of  the  previous  example. 

An  image  sequence  starting  with  the  frame  shown  in  figure  13  is  used  to  illustrate  the  technique 
for  detecting  objects  moving  in  a  smooth  environment.  La  this  example,  the  camera  moves  with 
respect  to  an  environment  consisting  of  various  small  pieces  of  hardware  lying  on  a  planar  surface. 
The  optical  flow  field  shown  in  figure  14  was  obtained  in  the  same  manner  as  in  figure  5.  Figure  15 
shows  the  locations  of  large  variations  in  optical  flow  values,  corresponding  to  the  boundary  of  a 
moving  object. 

A  stereo  image  sequence  starting  with  the  stereo  pair  shown  in  figure  16  is  used  to  illustrate 
the  technique  of  indirect  comparison  of  flow  and  disparity  as  a  basis  for  moving  object  detection. 
Both  the  flow  field  shown  in  figure  17,  and  the  disparity  field  shown  in  figure  18  were  obtained  using 
the  method  of  figure  5.  Comparing  the  ratio  of  the  change  in  disparity  values  to  the  change  in  flow 
values  across  neighboring  points,  and  selecting  as  the  boundaries  of  moving  objects  those  areas  in 
which  there  is  a  distinct  discontinuity  in  that  ratio,  results  in  the  identification  of  the  boundaries 
indicated  in  figure  19. 


fS-  • 


'  •/.  / 


/-  ///  •/ 
/h 


/  ':rP>/7  .■ v 


'// 


Figure  14:  Optical  flow  field  obtained  from  the  image  sequence  of  figure  13. 


L* 


Figure  15:  Boundary  of  a  moving  object  overlaid  onto  the  first  image  of  figure  13. 


Figure  17:  Optical  flow  field  obtained  for  right  image  sequence  of  figure  16. 
\ 


Figure  IS:  Disparity  field  obtained  across  the  stereo  pair  in  figure  16. 


20 


Figure  19:  Boundary  of  a  moving  object  overlaid  onto  the  right  image  of  the  stereo  pair  in  figure  16. 

9  Discussion. 

9.1  Which  method  to  use? 

This  paper  presents  a  collection  of  loosely  related  techniques  for  visually  detecting  moving  objects. 
Detection  based  purely  on  visual  motion  from  a  single  camera  seems  quite  difficult.  Each  of  the 
methods  presented  here  uses  some  sort  of  additional  information,  either  about  current  camera 
motion  or  scene  structure.  The  methods  are  characterized  by  the  additional  information  used,  the 
underlying  constraints  exploited,  and  the  particular  computational  structure  used  to  implement 
the  technique.  It  is  likely  that  reliable  moving  object  detection  will  require  several  complimentary 
techniques,  along  with  a  method  for  selecting  which  detector  to  trust  in  any  particular  situation. 

9.2  Computational  structure. 

The  methods  described  above  can  be  grouped  into  three  classes.  Point-based  techniques  (com¬ 
pletely  known  motion)  compare  individual  optical  flow  vectors  against  some  standard  to  determine 
incompatibilities  with  the  motion  of  the  camera  relative  to  the  environment.  In  all  cases  described 
here,  the  compatibility  measure  is  based  on  a  directional  constraint  associated  with  the  focus  of 
expansion  of  the  translational  flow  field.  Point-based  methods  have  the  advantages  of  computa¬ 
tional  simplicity  and  the  ability  to  detect  very  small  moving  objects.  They  will  be  most  effective 
when  parameters  of  motion  are  known  precisely  and  the  magnitude  of  the  translational  flow  field 
at  the  point  in  question  is  sufficiently  large  to  allow  an  accurate  estimate  of  direction.  Edge-based 
techniques  (known  rotation,  smooth  surface)  roughly  correspond  to  traditional  edge  detection. 


21 


Edge-based  motion  detection  is  characterized  by  the  differential  flow  properties  examined  and  by 
the  filtering  technique  used  to  separate  edges  due  to  range  discontinuities  from  those  due  to  moving 
objects.  The  approach  is  effective  when  surfaces  are  smooth  and  techniques  exist  for  accurately 
locating  those  range  discontinuities  that  do  exist.  Edge-based  methods  have  the  advantage  of  spec¬ 
ifying  the  outline  of  moving  objects  that  are  detected.  They  are  likely  to  be  of  limited  use  when 
moving  objects  are  quite  small.  Region-based  techniques  (tracked  object,  depth/flow  comparisons) 
examine  optical  flow  values  over  a  region,  searching  for  distributions  incompatible  with  rigid  mo¬ 
tion.  As  with  edge-based  approaches,  the  viewed  region  must  include  portions  of  both  object  and 
environment.  As  long  as  the  region  includes  portions  of  both  object  and  environment,  this  is  an 
effective  test  for  moving  objects  that  does  not  require  any  information  about  camera  motion.  The 
region-based  method  based  on  tracking  potentially  moving  objects  does  not  require  any  information 
about  camera  motion,  but  does  require  that  there  be  significant  variations  in  range  over  the  visible 
portions  of  the  environment. 


9.3  Limitations. 


All  detection  algorithms  founded  on  the  motion  epipolar  constraint  share  two  important  short¬ 
comings.  First,  environmental  flow  vectors  will  be  small  near  the  FOE  regardless  of  the  ranges 
involved.  As  a  result,  detection  based  on  flow  orientation  will  be  unreliable  within  a  region  around 
the  FOE.4  This  means  that  epipolar-based  methods  will  have  difficulties  for  viewing  directions 
close  to  the  direction  of  motion.  This  is  of  course  the  direction  in  which  moving  object  detectior 
is  likely  to  be  most  important.  One  heuristic  for  partially  overcomming  limitations  near  the  FOE 
is  to  look  for  large  magnitude  values  of  translational  flow  near  the  FOE.  Such  values  correspond 
either  to  moving  objects  or  to  environmental  points  that  are  very  clcee  to  the  camera.  Secondly, 
while  the  motion  epipolar  methods  were  developed  to  allow  for  the  possibility  of  a  moving  camera, 
translational  camera  motion  is  actually  a  requirement.  Without  translational  motion,  there  is  no 
motion  epipolar  constraint  to  violate.  More  specifically,  not  only  must  the  camera  be  moving, 
but  significant  portions  of  the  visible  environment  must  be  sufficiently  cloze  to  generate  detectable 
non-zero  translational  flow  values.  Meet  methods  based  on  the  depth/flow  or  rigidity  constraints 
should  work  for  both  moving  and  stationary  cameras. 

No  method  for  detecting  moving  objects  will  be  effective  if  it  depends  on  knowing  precise 
values  of  optical  flow.  Techniques  for  estimating  optical  flow  are  intrinsically  noisy  (e.g.,  see  [22j). 
Additional  difficulties  arise  due  to  the  idealized  nature  of  equations  (1-3).  Real  cameras  are  not 
point  projection  systems.  Substantial  effort  is  required  to  accurately  determine  the  values  of  z  and 
y  in  (1-3).  Geometric  distortions  in  the  optical  and  sensing  systems  affect  measured  locations  on 
the  image  plane.  Variabilities  in  effective  focal  length  can  be  substantial.  Reliable  techniques  will  be 
based  on  searching  for  large  magnitude  effects  in  the  flow  field  [23].  All  of  the  methods  described 
above  compare  flow  vectors  to  some  predetermined  standard,  or  look  for  significant  differences 
across  flow  boundaries.  As  a  result,  all  deal  with  relatively  large  magnitude  effects.  Reliability  is 

'Lawton  talks  about  a  '‘dead  zone"  around  the  FOE  within  which  no  information  based  exclusively  on  camera 
motion  is  available  [21].  This  effect  is  a  problem  not  only  for  moving  object  detection,  but  also  for  techniques  such 
as  motion  stereo. 


22 


still  dependent  on  scene  structure,  the  nature  of  camera  motion,  and  position  in  the  visual  field 
relative  to  the  direction  of  translation. 


Acknowledgement. 

Martin  Kenner  provided  significant  assistance  in  preparing  the  examples. 


References 

[1]  G.  Adiv.  "Inherent  ambiguities  in  recovering  3-d  motion  and  structure  from  a  noisy  flow  field”, 
Proc.  IEEE  Conference  on  Computer  Vision  and  Pattern  Recognition ,  70-77,  1985. 

[2]  Proc.  Workshop  on  Motion:  Representation  and  Analysis.  May  1986. 

[3]  S.  Ullman.  The  Interpretation  of  Visual  Motion ,  MIT  Press,  Cambridge,  MA,  1979. 

[4]  R.  Jain,  W.N.  Martin,  and  J.K.  Aggarwal.  “Extraction  of  moving  object  images  through 
change  detection”,  Proc.  Sixth  International  Joint  Conference  on  Artificial  Intelligence ,  425- 
428,  1979. 

[5]  R.  Jain,  D.  Militzer,  and  H.-H.  Nagel,  “Separating  non-stationary  from  stationary  scene 
components  in  a  sequence  of  real  world  TV  images”,  Proc.  Fifth  International  Joint  Conference 
on  Artificial  Intelligence ,  425-42S,  1977. 

[6]  A.M.  Waxman  and  J.H.  Duncan,  “Binocular  image  flows”,  Proc.  Workshop  on  Motion: 
Representation  and  Analysis ,  1986. 

[7]  W.B.  Thompson,  K.M.  Mutch,  and  V.A.  Berzins,  “Dynamic  occlusion  analysis  in  optical  flow 
fields”,  IEEE  Trans,  on  Pattern  Analysis  and  Machine  Intelligence ,  PAMI-7:374-383,  July 
1985. 

[8]  R.C.  Jain,  “Segmentation  of  frame  sequences  obtained  by  a  moving  observer”,  IEEE  Trans, 
on  Pattern  Analysis  and  Machine  Intelligence,  PAMI-6:624-629,  September  1984. 

[9]  W.B.  Thompson,  “Combining  motion  and  contrast  for  segmentation”,  IEEE  Trans,  on  Pat¬ 
tern  Analysis  and  Machine  Intelligence,  PAMI-2:543-549,  November  1980. 

[10]  W.F.  Clocksin,  “Perception  of  surface  slant  and  edge  label:  from  optical  flow:  a  computational 
approach”,  Perception,  9:253-269,  1980. 

[11]  K.  Nakayama  and  J.M.  Loomis,  “Optical  velocity  patterns,  velocity  sensitive  neurons,  and 
space  perception:  a  hypothesis”,  Perception,  3:63-80,  1974. 

[12]  A.R.  Bruss  and  B.K.P.  Horn,  “Passive  navigation”,  Computer  Vision,  Graphics  and  Image 
Processing,  21(l):3-20,  January  1983. 


23 


[13]  D.H.  BaLlard  and  O.A.  Kimball.  “Rigid  body  motion  from  depth  and  optical  flow’’.  Computer 
Vision,  Graphics  and  [mage  Processing ,  22:95-115,  1983. 

[14]  D.A.  -Marr.  Vision.  W.H.  Freeman.  San  Francisco.  1982. 

[15]  A.  Bandopadhav.  B.  Chandra,  and  D.H.  Ballard.  “Active  navigation:  tracking  an  environ¬ 
mental  point  considered  beneficial”,  Proc.  Workshop  on  Motion:  Representation  and  Analysis. 
23-29.  1986. 

[16]  T.S.  Huang.  S.D.  Blostein.  A.  Werkheiser.  M.  McDonnel.  and  M.  Lew.  “Motion  detection 
and  estimation  from  stereo  image  sequences:  some  preliminary  experimental  results”.  Proc. 
Workshop  on  Motion:  Representation  and  Analysis,  45-46.  1986. 

[17]  J.J.  Gibson,  The  Perception  of  the  Visual  World,  Riverside  Press,  Cambridge  MA,  1950. 

[18]  R.A.  Brooks.  A.M.  Flynn,  and  T.  Marill,  “Self  calibration  of  motion  and  stereo  for  mobile 
robots”,  Proc.  4{h  Int.  Symposium  on  Robotics  Research,  1987. 

[19]  J.H.  Reiger  and  D.T.  Lawton.  “Sensor  motion  and  relative  depth  from  difference  fields  of  optic 
flows”,  Proc.  Eighth  International  Joint  Conference  on  Artificial  Intelligence,  1027-1031.  1983. 

[20]  S.T.  Barnard  and  W.B.  Thompson.  “Disparity  analysis  of  images”,  IEEE  Trans,  on  Pattern 
Analysis  and  Machine  Intelligence,  PAMI-2:333-340,  July  1980. 

[21]  D.T.  Lawton,  personal  communication. 

[22]  J.K.  Kearney,  W.B.  Thompson,  and  D.L.  Boley,  “Optical  flow  estimation:  an  error  analysis 
of  gradient- based  methods  with  local  optimization”,  IEEE  Trans,  on  Pattern  Analysis  and 
Machine  Intelligence,  PAMI-9:229-244,  March  1987. 

[23]  W.B.  Thompson  and  J.K.  Kearney,  “Inexact  vision”,  Proc.  Workshop  on  Motion:  Represen¬ 
tation  and  Analysis,  15-21,  1986. 


24 


V 


iN.  i 


-  v 


Cyi 


r  7 


DETECTING  MOVING  OBJECTS 


William  8.  Thompson  Ting-Chuen  Pong 


Computer  Science  Department 
University  of  Minnesota 
Minneapolis.  MN  55455 


ABSTRACT 

The  detecnon  of  moving  objects  is  important  in  many  t«k«  This 
paper  examines  moving  object  detection  based  primarily  on  visual 
motion.  We  conclude  that  in  realistic  situations,  detection  using 
visual  information  alone  is  quite  difficult,  particularly  when  the 
camera  is  also  moving.  The  availability  of  additional  information 
about  camera  motion  and/or  scene  structure  greatly  simplifies  the 
problem.  We  develop  detection  algorithms  for  the  cases  in  which  1) 
camera  motion  is  known.  2)  only  camera  rotation  is  known.  3)  only 
camera  translation  is  known.  4)  objects  move  ui  contact  with  a 
smooth  surface,  and  3)  an  object  is  being  acuvely  tracked,  but  the 
camera  motion  associated  with  the  tracking  is  not  known  precisely. 
Examples  of  several  of  these  techniques  are  presented. 


environment  is  sufficiently  distant  so  that  there  is  no  perceived  opti¬ 
cal  flow.  it  is  not  possible  to  tell  whether  or  not  this  seemingly  sim¬ 
ple  pattern  corresponds  to  a  moving  object! 

Figure  l  provides  one  example  of  why  a  general  and  reliable 
solution  to  the  problem  of  moving  object  detecnon  based  only  on 
vimal  motion  is  not  feasible.  Robust  solutions  require  that  addi¬ 
tional  information  about  camera  motion  and/or  scene  structure  be 
available.  In  this  paper,  we  examine  a  variety  of  types  of  informa¬ 
tion  that  might  be  available.  Each  informanon  source  places  con¬ 
straints  on  the  optical  flow  fields  that  can  be  generated  by  a  camera 
moving  through  an  otherwise  static  environment.  Violations  of 
these  constraints  are  thus  necessarily  due  to  moving  objects. 


1.  Introduction. 

One  important  function  of  s  vision  system  is  to  recognize  the 
presence  of  moving  objects  in  a  scene.  If  the  camera  is  stationary 
and  illumirunon  constant,  this  can  be  done  by  simple  techniques 
which  compare  successive  image  frames,  looking  for  significant 
differences.  If  the  camera  is  moving,  the  problem  is  considerably 
more  complex.  For  the  purposes  of  this  discussion,  moving  objects 
are  taken  to  be  any  objects  moving  with  respect  to  the  stationary 
portions  of  the  scene,  which  we  refer  to  as  the  environment.  For  a 
moving  camera,  both  moving  objects  and  stationary  portions  of  the 
scene  may  be  changing  postdoc  with  respect  to  the  camera  and  thus 
generating  visual  motion  in  the  imagery.  A  moving  camera  leads  to 
difficulties  because  of  the  need  to  determine  objects  moving  with 
respect  to  the  environment  rather  than  the  much  easier  problem  of 
finding  objects  moving  with  respect  to  the  camera.  In  this  paper,  we 
deal  with  the  problem  of  detecting  moving  objects  from  a  moving 
camera  based  on  opncal  flow. 

The  visual  detection  of  moving  objects  is  a  surprisingly 
difficult  task.  A  simple  example  iUustntes  just  how  serious  the 
problem  can  be.  Consider  the  optical  flow  field  shown  in  figure  l. 
which  appears  to  show  a  small,  square  region  in  the  center  of  the 
image  moving  to  the  right  and  saerounded  by  an  apparently  station¬ 
ary  background.  Such  a  flow  field  can  arise  from  revert!  equally 
plausible  simadons:  1)  The  camera  is  stationary  with  respect  to  the 
environment  and  the  General  region  corresponds  to  an  object  mov¬ 
ing  to  the  right  2)  The  camera  is  moving  to  the  left  with  respea  to 
the  environment  most  of  the  environment  is  sufficiently  distant  so 
that  the  generated  optical  Bow  is  effectively  zero,  while  the  central 
region  corresponds  to  a  surface  near  to  the  camera  but  aadonary 
with  respect  to  the  environment  3)  The  camera  and  object  are  mov¬ 
ing  with  respea  to  both  the  environment  and  each  other,  though  the 


This  work  »«  repp— 4  by  Air  Fore*  Offlor  at  triiifa  Siweh 
AF0M43-03S1 


Figure  l:  Is  The  Central  Region  a  Moving  Object? 


Figure  2  summarizes  potential  sources  of  informanon  and  the 
associared  constraints  on  opncal  flow.  The  next  section  lisa  genesal 
properties  needed  by  reliable  detection  algorithms.  Following  this 
is  a  derivation  of  each  of  the  flow  constraints.  We  conclude  with 
experimental  demonstration  of  several  of  the  techniques  and  general 
observations  about  the  nature  of  these  "vHwi' 

2.  Assumptions. 

We  start  with  the  presumption  that  motion  detection  algo¬ 
rithms  should  be  designed  with  the  following  properties  in  mind; 


The  field  of  view  map  be  relatively  narrow. 

Motion  detection  should  not  on  the  use  of  wide  angle  imag- 
ing  systems.  Such  systems  may  not  be  t volatile  in  a 
situation,  and  if  used  may  increase  the  difficulty  or  recognizing 
small  moving  otyeco.  As  a  result.  algorithms  sbouki  not 

depend  on  subtle  properties  of  perspective. 


201 


CH2465- 3/87/0000/0201 SOI  .00  ©  1987  IEEE 


Knowing: 

Yields  a  constraint  on: 

full  parameters  of  motion 

flow  values 

parameters  of  rotation 

I 

variability  of  flow  direction  j 

surfaces  are  smooth 

local  van  ability  of  direction  | 
or  magnitude  of  flow 

object  is  being  tracked 

global  variability  of 
direcnon  of  flow 

Figure  2:  Constraint*  on  Flow. 


The  i ma%e  of  moving  ob/ects  may  be  small  with  respect  to  the  field 
of  view. 

This  is  clearly  tfrnrahic  for  reliability.  Moving  objects  may  be  far 
away  and  svfrwvWvt  by  relatively  small  visual  angles.  We  need 
methods  capable  of  identifying  single  image  points,  or  at  least  small 
roii^fion<  of  points,  as  corresponding  to  moving  objects.  Detec¬ 
tion  algorithms  thus  cannot  depend  on  vanations  in  flow  over  a 
potentially  moving  object. 

Only  monocular  imagery  is  available. 

This  is  equivalent  to  the  situation  where  objects  of  interest  can  be 
far  away  relative  to  the  camera  base-line  in  a  stereo  viewing  s’" na¬ 
tion. 

Estimated  optical  flow  fields  will  be  noisy 

No  method  is  capable  of  estimating  optical  flow  with  arbitrary  accu¬ 
racy.  Motion  detection  based  on  optical  flow  must  be  tolerant  of 
noisy  input. 

Only  “ instantaneous "  optical  flow  is  used. 

A  restriction  to  instantaneous  flow  eliminates  the  use  of  temporal 
derivatives  of  flow  and/or  multiple  views  at  distinct  time  intervals. 
Temporal  ’w***1  will  increase  noise  in  the  estimated  Sow 

values.  Use  of  "*" **?*»•  views  increase  computational  complexity 
(In  tea.  experience  with  There  are  reasons  to  believe  that  mold- 
(rime  analysis  may  in  fact  improve  reliability  [1], 

though  they  are  not  examined  in  this  work.) 

3.  Constraints  on  Optical  Flow. 

The  basic  mwtwmwira  governing  the  optical  flow  generated 
by  a  moving  camen  is  well  known.  We  take  our  notation  from  (2], 
using  a  coordinate  system  fixed  to  the  camera  (e.g.  the  world  can  be 
thought  of  is  moving  by  t  stwxxiary  camen).  Optical  flow  values 
are  s  function  of  image  location,  the  relative  motion  between  the 
camera  mid  the  surface  potst  corresponding  to  the  image  location, 
and  die  distance  from  the  camen  to  the  corresponding  surface  point. 


Let  p  •  (x.y )  refer  to  an  image  location.  where  x  and  y  have  been 
normalized  by  the  focal  length  of  the  camera.  Let  P  *(X.Y. Z)  be 
the  coordinates  of  the  surface  pota  projecting  onto  (x.y).  specified 
in  a  coordinate  system  with  ongm  at  the  camera  and  Z  axis  along 
the  optical  axis  of  the  camera.  Specify  the  moaon  of  the  point  at 
( X.Y.Z )  with  respect  to  the  camera  in  tenns  of  a  translational  vclo- 
cityT *(U.V.W)T  and  a  rotational  velocity  «■  (A, B.C)r.  The 
optical  flow.  B -(u.v).  at  p  is  purely  a  function  of  x.y,  T.  m.  and 
Z: 


U  ««,  +  U,  ,  V  MV,  ♦  V,  (1) 

where  u  is  the  x  component  of  flow,  v  is  the  y  component  of  flow, 
and 


-U+xW  -V  +  vW 

“*  Z  '  z 


(2) 


U,  •  Axy-Blx1*  t)+Cy 
v,  -  A(y2*  D-flxy  -Cx 


Let  the  panmeters  specifying  camera  motion  with  respect  to  the 
environment  be  T,  and  a,  and  the  corresponding  parameters  speci¬ 
fying  relative  motion  between  the  camera  and  a  scene  point  P  be  TP 
and  tap. 

3.1.  Known  translation  and  rotation. 

The  parameters  of  camera  moaon  constrain  possible  optical  flow 
values  that  can  occur  due  to  camera  moaon  with  respect  to  the 
environment. 


If  complete  information  about  instantaneous  camera  motion  is 
available,  then  T,  and  o,  are  known.  If  the  camen  is  translating 
but  not  rotating  with  respect  to  the  background,  «,  »  0  and  all 
flow  vecton  due  to  the  moving  image  of  the  background  will  radiate 
away  from  a  focus  of  expansion  (FOE).  From  equation  (1).  it  is 
easy  to  see  (hat  the  image  plane  location  of  die  FOE  is  at: 

U  V  ... 

».  3  -£T  (4> 


The  location  of  the  FOE  depends  only  on  the  direction  of  transla¬ 
tion.  not  on  the  speed,  so  methods  for  monon  detection  which 
depend  on  the  location  of  (he  FOE  do  not  actually  require  the  com¬ 
plete  parameters  of  translational  motion.  The  FOE  may  not  lie 
within  the  visible  portion  of  the  image  (and  in  fact  may  be  a  focus 
of  contraction).  A  FOE  at  —  corresponds  10  pure  lateral  motion, 
which  generates  a  parallel  optical  flow  pattern.  At  every  image 
poire  p.  knowing  the  FOE  frilly  specifies  the  direction  of  optical 
flow  mocistrel  with  any  surface  pouu  stationary  with  respect  to  the 
environment.  At  p: 


9/* 


tan*1 


V-Wy 

U-Wx 


on 


,-1 


(5) 


where  dtw  is  the  direction  from  p  towards  the  FOE  and  is  the 
direction  of  optical  flow  at  p.  (Note  that  the  first  equation  is  sull 
well  defined  even  if  IF  ■  0.  corresponding  to  a  focus  of  expansion  at 
—  in  image  coordinates )  Any  flow  values  with  a  different  direction 
correspond  to  moving  objects  (3).  Eg.,  moving  objects  exist  when¬ 
ever  (fl/u,  >  e.  for  some  appropriate  e.  (It  is  possible  that 

moving  objects  coincidentally  generate  flow  values  compatible  with 
this  cornu  tint.)  This  approach  requires  the  estimation  of  only  the 
direction  of  flow,  not  either  the  msgremde  or  spatial  van  icon  of 
flow. 


Camen  rotation  introduces  considerable  complexity. 
Knowledge  of  camen  motion  no  longer  constrains  the  direction  of 
background  flow.  Nevertheless,  at  a  given  point  p.  flow  is  con- 


202 


stnmted  oo  a  one-dimensional  family  of  possible  vector  value*.  The 
family  is  given  try  ( 1 )  -  (3)  where  Z  rages  over  ail  positive  values. 
The  analysis  can  be  simplified  because  of  the  linear  nature  of  (1). 
Ur  and  v,  depend  only  on  the  parameters  of  rotation  and  not  on  any 
shape  property  of  the  environment.  Because  the  values  of  it  and  v, 
at  a  particular  point  p  do  not  depend  on  Z.  they  can  be  predicted 
knowtig  only  a.  These  values  can  be  subtracted  from  the  observed 
optical  flow  field,  leaving  a  translational  flow  field: 

F,  «(u,.v,)*  F-F,  .  F,  «(u,.v,)  (6) 

where  a.  and  v,  are  defined  in  equation  (3).  This  field  behaves  just 
as  if  no  rotation  was  occurring,  and  thus  moving  objects  can  be 
located  using  the  FOE  technique  described  above.  For  the 
remainder  of  this  paper,  when  rotation  is  present,  we  wiil  take  the 
term  FOE  to  refer  to  the  focus  of  expansion  of  this  tnnslsnonai 
field. 


In  principle,  even  if  camera  morion  is  not  known  T,  and  a», 
may  be  estimated  from  die  imagery  (2],  subject  to  a  positive,  multi - 
plicanve  scale  factor  for  T, .  Two  serious  problems  exist,  however. 
Narrow  angles  of  view  make  estimation  of  camera  morion  difficult, 
as  significantly  different  parameters  of  morion  and  surface  shape 
can  yield  neatly  identical  opacal  flow  patten  (4).  In  addition,  tech¬ 
niques  such  as  (21  uses  a  global  minimization  approach  which  will 
not  perform  well  if  moving  objects  make  up  a  substantial  portion  of 
the  field  of  view.  A  clustering  approach  (e.g.  13])  can  be  made 
tolerant  of  the  moving  objects,  though  great  difficulty  can  be 
expected  dealing  with  a  five  dimensional  cluster  space. 

32.  Known  rotation. 

The  parameters  of  camera  rotation  constrain  the  local  variability  of 
optical  flow  direction  that  can  occur  due  to  camera  moaon  with 
respect  to  the  environment. 

Often,  information  about  camera  rotation  is  available,  even 
when  the  direction  of  translation  is  not  known.  Non-visual  infotma- 
tioo  about  camera  moaon  often  comes  from  inertial  sources.  Such 
sources  are  much  more  accurate  in  determining  rotation  than  trans- 
larioa  Rotation  involves  s  continuous  acceleration  which  is  easily 
measured.  The  determination  of  translation  requires  the  integration 
of  accelerations,  along  with  a  starting  boundary  value.  Errors  in 
estimated  translarion  values  rapidly  accumulate.  A  simple  tech¬ 
nique  allows  the  detection  of  moving  objects  when  only  camera 
rotanon  is  known. 

In  the  previous  section,  knowledge  of  camera  rotanon  made  it 
posable  to  compute  the  translational  Bam  field.  F,.  Knowledge  of 
translation  was  then  used  to  locate  the  FOE  aid  thus  constraint  the 
direction  of  flow  vectors  anon  and  with  the  environment.  If  only 
rotation  is  known,  then  it  is  still  posable  to  determine  the  crasta- 
tional  flow  field,  but  not  the  FOE.  Visual  methods  could  be  applied 
to  the  translational  flow  field  to  estimate  the  location  of  the  FOE. 
but  these  methods  suffer  from  a  number  of  practical  limitations 
when  applied  to  noisy  dam.  An  ahernae  approach  can  be  used 
which  does  not  require  the  prior  detrrminaion  of  the  FOE.  The 
translational  flow  field  extends  radially  from  the  focus  of  expansion. 
At  any  point  significantly  sway  from  the  FOE.  the  direction  of  Bow 
(but  not  necessarily  the  magnitude  of  flow)  will  vary  slowly.  Direc¬ 
tional  variability  can  be  evaluated  based  oo  equation  (S): 

Sty,  W(V  -yW) 

&x  *  + 

86*.  W  (U-sW)  ( 

by  *"  (V-jWf  +  IU-xWy1 

The  gradient  of  the  direction  of  the  translational  flow  field  can  thus 
be  obtained  is 


50*.; 

,  2  , 

f  88/-] 

&X 

•  * 

| 

(fl) 


where  (tfm  )  is  the  image  plane  location  of  the  FOE  We  can 
see  from  the  above  equation  that  over  any  local  area  away  from  the 
FOE  vanaoons  in  the  direction  of  the  translational  flow  field  will 
be  mull.  Flow  ariring  due  to  moving  objects  u  of  course  not  (ob¬ 
ject  u  tins  restriction.  The  gradient  of  flow  field  direction  can  thus 
be  used  to  detect  the  boundaries  of  movmg  objects.  At  these  boon, 
daries,  flow  direction  will  vary  disconanuoualy1. 


A  complementary  technique  is  avrilabie  to  deal  with  situa¬ 
tions  m  which  translation  but  not  rotation  is  known.  We  can  expea 
these  situations  to  be  rare,  however.  If  the  direction  of  translation 
were  known  over  some  interval  of  time,  tt  would  be  an  easy  mater 
to  determine  the  rotation  by  examining  the  rate  of  change  of  direc¬ 
tion. 


32.  Motion  over  smooth  surfaces. 

Object  motion  over  smooth  surfaces  constrains  the  local  variability 
of  flow. 


Knowledge  of  the  shape  of  environmental  surfaces  can  be 
used  to  simplify  the  motion  detection  problem.  Scene  structure  may 
be  known  precisely  (e.g.  the  range  to  visible  surface  pomes)  or  in 
terms  of  general  properinea  (e.g.  significant  depth  discontinuities 
can  be  expected).  Infomanon  about  scene  souaure  can  come  from 
visual  sources  (e.g  stereo  [9. 101).  or  from  pre-existing  models  of  the 
environment-  If  both  die  optical  Sow.  (u .  v ).  and  the  depth.  Z.  an 
known  for  a  collection  of  suttee  potnu  in  the  environment,  then 
(t)“(3)  can  be  used  to  creme  a  system  of  equations  which  can  be 
solved  for  the  parameters  of  motion  T  and  m  If  die  collection  of 
puma  includes  setae  values  asrorimed  with  the  environment  and 
others  associated  with  one  or  more  objects  moving  with  respect  to 
the  environment,  the  ayarem  of  equations  used  »  solve  for  T  and  ■ 
will  be  inconsistent.  Checking  the  sysrem  for  consisitncy  can  there- 
fote  be  used  as  a  test  for  the  presence  of  a  moving  object  (e.g.  a  test 
for  non-rigid  morion  in  the  field  of  view.) 

If  moving  objects  mum  remain  in  contact  with  environmental 
suttees  (e.g.  vehicular  morion),  a  less  complex  technique  depend¬ 
ing  only  on  knowing  the  image  plane  locations  corresponding  to 
discontinuities  in  tinge  is  possible.  If  no  objects  ire  moving  within 
the  field  of  view,  oqumione  < l)  —  (3)  can  be  simplified  into  the  fol¬ 
lowing  form: 


flowlp)  •  /r(p)  +  7j^  W 

where  st  an  image  point  p.  flow  ip)  is  the  optical  flow  (a  two- 
dimensional  vector),  /,  is  the  component  of  the  flow  due  to  the 
roorion  of  the  scene  with  respect  to  the  sensor,  f,  is  dr  pendent  oo 
the  translational  morion  of  the  moor  and  the  viewing  ngle  relative 
to  the  direction  of  translation.  and  r  is  the  disance  between  the  sen¬ 
sor  mid  the  suttee  visible  st  p  (Le.  the  value  of  Z  in  equation  2 
corresponding  to  the  image  location  p).  For  fixed  p.  flow  varies 
inversely  with  dtsanoe  Both  f,  and  f,  vary  slowly  (rod  continu¬ 
ously)  with  p.  Discontinuities  in  flow  thus  correspond  to  discon¬ 
tinuities  in  r.  This  relationship  bolds  only  for  relative  motion 
between  the  camera  and  a  tingle,  rigid  structure.  When  multiple 
moving  objects  are  present  equation  (9)  mum  be  modified  so  that 


1  Mar  [6]  claim  "if  dkmaat  of  (riratij  moaoa  a  a*w  ilin  Mama  a  men 
hoi  oaa  paws  -  iW|  ■  baa,  tor  oa* lay  ibat  m  oOjaa  tnaalajMB 

Uiai^ UsT*’ r  ™*M"  WwMi 


203 


there  is  a  separate  fr{,)  and  /,(<>  specifying  the  relanve  motion 
between  the  sensor  and  each  rigid  object.  Discomnumes  in  flow 
can  now  anse  either  due  to  a  discontinuity  in  rmge  or  due  to  the 
boundaries  of  a  moving  object.  If  independent  infonnanon  is  avail¬ 
able  on  the  location  of  range  discontinuities,  and  other  disconnnui- 
ues  ui  flow  must  be  due  to  moving  objects. 

The  motion  detection  problem  becomes  particularly  simple  if 
the  environment  is  planar.  In  this  case,  depth  discontinuities  are  not 
posable  and  an y  discontinuity  in  flow  (either  direcnon  or  magni¬ 
tude)  corresponds  to  the  boundary  of  a  moving  object  Note  that  it 
is  not  sufficient  to  know  simply  that  the  environment  is  a  "smooth'’ 
surface.  From  some  viewing  potmens.  even  smooth  surfaces  may 
exhibit  range  discormnuines. 

3.4.  Tracking  regions  of  Interest. 

Tracking  an  object  constrains  the  global  variability  of  the  direction 
of  flow  in  the  surrounding  area. 

A  vision  system  which  can  actively  control  camera  direction  is 
capable  of  tracking  regions  of  interest  over  time,  keeping  some  par¬ 
ticular  object  centered  within  the  field  of  view.  Tracking  regions  of 
interest  is  desirable  for  many  reasons  other  than  the  detection  of 
moving  objects  (e.g.  (Ill),  though  the  analysis  of  imagery  arising 
from  a  tracking  camera  has  nos  received  much  study  by  the  com¬ 
puter  vision  community.  If  there  are  significant  vananons  in  depth 
over  the  visible  portion  of  the  background  and  if  moving  objects  are 
relatively  small  with  respect  to  the  field  of  view,  then  moving  object 
detection  based  on  cracking  can  be  accomplished  without  any  actual 
knowledge  of  camera  motion.  (For  mouon  detection,  the  tracking 
can  easily  be  simulated  if  the  camera  is  not  actively  controllable.) 


If  an  object  is  being  tracked,  then  its  optical  flow  is  zero. 
Flow  based  methods  for  determining  whether  or  not  a  tracked  object 
is  moving  must  depend  wholly  on  the  patterns  of  flow  in  the  back¬ 
ground.  Object  tracking  helps  in  moving  object  detecnan  because  it 
Rummtzea  many  of  the  difficulties  due  to  rotatioa.  When  dealing 
with  instantaneous  flow  fields,  we  ton  decompose  the  problem  by 
considering  all  translational  motion  to  be  due  to  movement  of  the 
camera  platform  and  all  rotational  motion  due  to  pan  and  tilt  of  the 
camera  to  accomplish  the  tracking.  (We  will  disregard  any  effects 
due  to  spin  around  the  line  of  sigbL)  Consider  the  effect  of  tracking 
a  point  mat  is  in  fact  pan  of  the  environment.  The  translational 
component  of  mouon  induces  an  optical  flow  pattern  field  extends 
radially  from  the  focus  of  expansion,  with  magtanides  dependent  on 
the  range  to  the  corresponding  surface  points.  Over  a  local  area 
away  from  die  focus  of  expansion,  the  direction  of  translational  flow 
will  be  approximately  constant.  The  rotational  component  of 
motion  induces  a  flow  pattern  which  over  a  local  area  is  approxi¬ 
mately  constant  m  both  direction  and  magnitude.  The  magnitude 
and  direcnon  are  exactly  opposite  the  translational  flow  of  the 
tracked  point.  From  equations  (2)  and  (3).  it  is  easy  »  see  that  at 
the  tracked  point  (x  .y  )■  (0.0) 
u  v 


a,-7* 


v,.-T 


(10) 


It,  m-B.  V,  -A  (11) 

Sinoe  the  optical  flow  is  zero  at  the  tracked  point,  we  have 

-^r-B  ■  0,  or  u,  ■-«,  (12) 

V 

--+A  -0.  or  v,  m-v,  (13) 

The  effect  on  the  combined  fields  is  that  in  the  neighborhood  of  die 
tracked  point,  the  direction  of  flow  will  be  approximately  constant 


(modulo  180*),  with  a  magnitude  dependent  on  the  difference 
between  the  range  to  the  corresponding  surface  point  and  the  range 
to  the  tracked  point  Now.  consider  tracking  a  point  that  a  moving 
with  respect  to  the  environment  If  environmental  surface  points  are 
viable  In  the  neighborhood  of  the  Hacked  point  and  if  there  is  a 
variation  in  range  to  these  environmental  points,  then  there  will  be  a 
variation  in  direction  of  flow  over  the  neighborhood. 

4.  Examples. 


A  set  of  experiments  on  moving  object  detection  baaed  on  the 
techniques  discussed  in  the  previous  sections  have  been  preformed 
on  real  images.  Experimental  results  are  presented  m  this  section 
for  the  cases  In  which  1)  the  camera  nxanon  is  known.  2)  objects 
move  in  a  smooth  environment,  and  3)  a  potentially  moving  object 
is  being  actively  tracked. 

Figure  3  shows  ihe  first  frame  in  a  sequence  of  of  images  of  an 
indoor  scene.  In  this  example,  the  camera  rotates  and  translates 
with  respect  to  the  environment  while  the  toy  vehicle  on  the  table 
moves  to  the  right  between  image  frames.  The  rotational  velocity 
of  the  camera  with  respect  to  the  environment  was  measured.  The 
optical  flow  field  shown  in  figure  *  was  obtained  by  the  token 
matching  technique  described  in  [10].  The  translational  flow  field 
shown  in  figure  3  was  obtained  oy  subtracting  ihe  rotational  Bow 
component  computed  from  the  known  rotanonal  velocity  from  the 
observed  optical  flow  field  (figure  4).  The  gradient  of  Bow  direcnon 
in  the  translational  flow  field  was  used  to  detect  the  boundaries  of 
moving  objects.  Figure  6  shows  the  detected  boundary  of  a  moving 
object  overlaid  onto  the  first  frame  of  figure  3. 


204 


Figure  4:  Optical  flow  field  obtained  from 
the  image  sequence  of  figure  J. 


Figure  5:  Tranitat tonal  flow  field  determinetl 
from  the  optical  flow  field  of  figure  4. 


Figure*:  Bmmdary ofamovtag  object 
overlaid  onto  the  flnt  Image  of  flgure  3. 


An  image  sequence  starting  with  the  frame  shown  m  figure  7 
is  used  to  illustrate  the  technique  for  deferring  objects  moving  in  a 
smooth  envuntanent  In  this  example,  the  camera  moves  with 
respect  to  an  environment  conststing  of  tans  and  bolts  lying  on  a 
planar  surface.  The  optical  flow  field  shown  in  figure  8  was 
obtained  in  the  »»««  manner  as  in  figure  4.  Figure  9  shows  die 
locations  of  large  variations  in  optical  flow  values,  corresponding  to 
the  boundary  of  a  moving  object 


Figure  7:  First  frame,  nuts  and  boils  sequence. 


Figure  I:  Optical  flow  field  obtained 
from  the  image  sequence  of  flgure  7. 


203 


Nr-/.-;  -  ./ 


Ml 


Figure  9:  Boundary  of  a  mortal  object  overlaid 
onto  the  tint  image  of  figure  7. 


Figure  II:  Optical  flow  field  obtained 
from  the  image  requmce  of  figure  10. 


In  figure  10.  the  circular  object  in  the  center  of  the  image  is 
being  tracked  by  the  camera  while  the  camera  is  translating  to  the 
nght  with  respect  to  the  *" «hm—  Figure  11  shows  the 
««w<i  optical  flow.  Figure  12  shows  a  histogram  of  the  ditec- 
nans  of  the  optical  flow.  Norn  that  there  ate  two  distinct  peaks  in 
the  histogram.  The  highest  pesk  corresponds  to  the  optical  flow 
vectors  auociamd  with  the  background  and  the  second  peak 
corresponds  to  the  optical  flow  vectors  associated  with  the  box  and 
the  table  in  the  forepound.  The  vanasian  in  flow  direction  over  the 
image  was  computed  to  be  approximately  26*.  indicating  that  the 
tracked  object  was  in  fact  moving. 


Flgnra  12:  Histogram  of  the  flow  directions 
of  the  optical  Sow  vectors  in  figure  10. 


Figure  10:  First  frame  of 


At  a  comparison,  a  similar  experiment  in  which  the  tracked 
object  is  staoonary  with  respect  to  the  environment  while  the  cam¬ 
era  la  moving  was  also  prefimnud.  A  pair  of  imsgm  similar  to  that 
of  figure  10  wan  obtained.  The  resulting  rsrimasnd  optical  flow 
field  Is  shown  m  Hgure  13.  la  oorrmponding  histogram  u  shown  a 
Agora  14.  Nom  firm  only  one  distinct  peak  a  observed  In  this  Histo¬ 
gram.  The  global  variation  in  flow  direction  m  this  care  was  com- 
[xnad  to  he  sppmrimTiy  14*  which  is  significantly  smaller  dun 
th*  of  tha  previous  example. 


Figure  13:  Optical  flow  field  obtained  from  tracking  an  object 
which  it  stationary  with  respect  to  the  environment. 


Figure  14:  Histogram  of  the  flow  directions 
of  the  optical  flow  vectors  in  figure  13. 


5.  Discussion. 

The  methods  described  above  can  be  grouped  into  three 
classes.  Point-based  techniques  (known  motion,  known  translation) 
compare  individual  optical  flow  vectors  against  some  standard  to 
determine  incompatibilities  with  the  motion  of  the  camera  relative 
to  Use  environment.  In  all  caaes  described  here,  the  compatibility 
measure  is  baaed  on  a  directional  consxraiat  associated  with  the 
focus  of  expansion  of  the  trenaianonal  flow  field.  Point-based 
methods  have  the  advantages  of  computational  simplicity  and  the 
ability  to  detect  very  small  moving  objects.  They  will  be  most 
effective  when  parameters  of  motion  are  known  precisely  and  the 


magnitude  of  the  translational  flow  field  at  the  point  in  quesuon  is 
sufficiently  large  to  allow  an  accurate  estimate  of  direction.  Edge- 
bar ed  techniques  (known  rotation,  smooth  surface)  roughly 
correspond  to  traditional  edge  detection.  Edge-based  motion  detec¬ 
tion  is  characterized  by  the  differential  Bow  properties  examined 
and  by  the  filtering  technique  used  to  separate  edges  due  to  range 
discontinuities  from  those  due  to  moving  objects.  The  approach  is 
effective  when  surfaces  are  smooth  and  techniques  exist  for  accu¬ 
rately  locating  those  range  discontinuities  that  do  exist.  Edge-based 
methods  have  the  advantage  of  specifying  the  outline  of  moving 
objects  that  are  detected.  They  are  likely  to  be  of  limited  use  when 
moving  objects  are  quite  small.  Region-based  techniques  (tracked 
object)  examine  optical  flow  values  over  a  region,  searching  for  dis¬ 
tributions  incompatible  with  rigid  motion.  As  with  edge-based 
approaches,  the  viewed  region  must  include  potuons  of  both  object 
and  environment.  As  long  as  the  region  includes  pontons  of  both 
object  and  environment,  this  is  an  effective  test  for  moving  objects 
that  does  not  require  any  information  about  camera  motion.  The 
region-based  method  based  on  tracking  potentially  moving  objects 
does  not  require  any  information  about  camera  motion,  but  does 
require  that  there  be  significant  vanaoons  in  range  over  the  visible 
portions  of  the  environment. 

One  region-based  technique  not  discussed  above  is  based  on 
an  explicit  check  for  ngidity.  Several  structure-from-mouon  algo¬ 
rithms  provide  an  estimate  of  rigidity  [11.12.13).  Such  checks  can 
presumably  be  used  to  recognize  non-rigid  mouon  due  to  the  pres¬ 
ence  of  a  moving  object.  Numencal  structure-from-mouon  algo¬ 
rithms  have  proven  to  be  unsatisfactory  in  pracuce  due  to  severe 
problems  with  ill-conditioning.  It  is  not  yet  clear  whether  or  not  the 
test  for  rigidity  can  be  performed  in  a  sufficiently  noise  tolerant 
manner  to  provide  for  reliable  moving  object  detection. 

So  method  for  detecting  moving  objects  will  be  effective  if  it 
depends  on  knowing  precise  values  of  optical  flow.  Techniques  for 
estimating  optical  flow  are  intrinsically  noisy  (e.g.  see  [14]).  Addi¬ 
tional  difficulties  arise  due  to  the  idealized  nature  of  equauons  ( 1)  - 
(3).  Rest  cameras  ire  not  point  projection  systems.  Substantia! 
effort  is  required  to  accurately  determine  the  values  of  x  and  y  in 
(2)  and  (3).  Geometric  distortions  in  the  opucal  and  sensing  systems 
affect  measured  locations  on  the  image  plane.  Variabilities  in  effec- 
uve  focal  length  to  to  focus  can  be  substanual.  Reliable  techniques 
will  be  based  on  searching  for  large  magnitude  effects  in  the  flow 
field  (15).  All  of  the  methods  described  above  compare  flow  vec¬ 
tors  to  some  predetermined  standard,  or  look  for  significant  differ¬ 
ences  across  flow  boundaries.  As  a  result,  all  deal  with  relauvely 
large  magnitude  effects,  though  reliability  is  dependent  on  scene 
structure,  the  nature  of  camera  motion,  and  position  in  the  visual 
field  relative  to  the  direction  of  translation. 

Mary  of  the  techniques  described  above  are  based  on  compar¬ 
ing  flow  values  at  different  points  within  the  field  of  view.  AU  of 
these  methods  require  that  measurable  optical  flow  exist  for  points 
both  in  the  environment  and  on  moving  objects.  (Some  require  only 
that  the  translational  flow  be  measurable.)  Such  methods  share  three 
important  limitations:  1)  they  ere  ineffectual  near  the  FOE.  2)  the 
camera  must  be  moving,  and  3)  portions  of  the  visible  environment 
must  be  sufficiently  dote  to  generate  recognizably  non-zero  transla¬ 
tional  flow  values.  Near  the  FOE.  flow  due  to  the  environment  will 
be  close  to  zero,  regardless  of  range.  If  the  camera  is  not  moving, 
ill  environmental  flow  values  will  be  zero.  The  wme  is  true  if  all 
points  in  the  environment  are  very  distant  relative  to  the  speed  of 
translation.  These  limitations  do  not  apply  just  to  the  methods 
listed  above,  as  illustrated  by  figure  1.  they  are  general  problems 
■isaoaated  with  any  vision-based  motion  detection  scheme  that  does 
not  have  accurate  information  about  camera  translation  and/or  range 
to  visible  surface  points. 


207 


BIBLIOGRAPHY 


[1]  TJ.  Broida  and  R.  Ovcllappa.  "  Emm  anon  of  object  mooon 
parameters  from  noisy  images.’'  IEEE  Trans.  Pattern  Analysis 
and  Machine  Intelligence,  January  1986. 

[21  A.R.  Brass  and  BJC.P.  Horn.  “Passive  Navigation.”  Computer 
Vision.  Graphics,  and  Image  Processing,  v.  21.  n.  1.  pp.  3-20. 
1983. 

[3]  R.C.  Jain.  “‘Segmentation  of  Frame  Sequences  Obtained  bv  a 
Moving  Observer.”  IEEE  Trans.  Pastern  Analysis  and 
Machine  Intelligence.  voL  PAMI-6,  no.  3.  pp.  624-629. 

September  1984. 

[4]  G  Adiv.  "Inherent  ambiguities  in  recovering  3-D  mouon  and 
structure  from  a  noisy  flow  field.”  Proc.  IEEE  Conf.  on 
Computer  Vision  and  Pattern  Recognition.  June  1983. 

[31  D.H.  Ballard  and  O.A.  Kimball.  "Rigid  body  mouon  from 
depth  and  optical  flow.”  Computer  Vision.  Graphics,  and 
Image  Processing,  voL  21.  pp.  3-20. 1983. 

[6]  D.A.  Marr.  Vision.  Sr-  F  jiasco:  W  H.  Freeman  and 
Company.  1982. 

[7]  A.M.  Waxman  and  ..a.  Duncan.  "Binocular  image  flows.” 
Proc.  Workshop  on  Motion:  Representation  and  Analysis. 
pp.  31-38.  May  1986. 

[8]  T.S.  Huang.  S.D.  Blostem.  A.  Werkheiser.  M.  McDormei.  and 
M.  Lew,  “Motion  detection  and  estimation  bom  stereo  image 
sequencer  Some  preliminary  experimental  results." 
Proceedings  Workshop  on  Motion:  Representation  and 
Analysis.  May  1986. 

[9]  A.  Bandopadhay,  B.  Chandra,  and  D.H.  Ballard.  "Active 
navigation:  Tracking  an  envmximental  point  considered 
beneficial,"  Proceedings  Workshop  on  Motion:  Representation 
and  Analysis.  May  1986. 

[10]  S.T.  Barnard  and  W.B.  Thompson.  "Disparity  analysis  of 
images.  "  IEEE  Trans.  Pastern  Analysis  and  Machine 
Intelligence.  voL  PAM1-2.  pp.  333-340.  July  1980. 

[11]  S.  Ullmsn.  "Maximizing  rigidity:  The  incremental  recovery 
of  3-D  structure  from  rigid  and  rubbery  mooon.  ”  Perception. 
1984. 

[12]  A.  Mitche.  S.  Sdda.  and  J.K.  Agganvai,  "Determining 
position  ad  dtapiacemm  m  space  bum  images,"  Proc.  IEEE 
Coef.  on  Computer  Vision  and  Pattern  Recognition,  June  1983. 

[13]  E-C.  Hildreth  ad  NM.  Greywacz.  "The  tneremenial  recovery 
of  uncut  bum  morion:  Position  vs.  velocity  based 
formulations.”  Proceedings  Workshop  on  Motion: 
Representation  and  Analysis.  May  1986. 

[14]  J.K.  Kearney  ad  W.B.  Thompson.  “Gradient  baaed  estimation 
of  dijpanxy,"  Proc.  IEEE  Conf.  on  Pattern  Recognition  and 
Image  Processing,  Jae.  1982. 

[13]  ‘‘Inexact  vision."  Proceedings  Workshop  on  Motion: 
Representation  and  Analysts,  May  1986. 


I  \>G. 


C/v; 


T_AC_TS  C  F 

Cc  ^  T7T  TicrJ/><-  I v 


pt??.loY>cu£J  1  ° 


THE  2  j-D  SKETCH 

4 


William  B.  Thompson  Lincoln  G.  Craton 


Albert  Yonas 


Computer  Science 
Department 


Institute  of 
Child  Development 


Institute  of 
Child  Development 


University  of  Minnesota 
Minneapolis,  MN  55455 


1  Introduction. 


It  has  been  known  for  many  years  that  motion  information  provides  a  cue  for  depth.  Two  rather  distinct 
types  of  information  are  provided.  Relative  motion  of  surface  points  is  an  indication  of  the  relative  depth 
of  the  points.  (In  this  article,  we  will  use  the  term  depth  to  indicate  the  range  from  the  observer  to  visible 
surface  points.)  If  the  surface  points  in  question  are  part  of  the  same  rigid  object,  the  analysis  of  relative 
visual  motion  leads  to  the  structure-from-motion  and  motion-from-structure  algorithms  currently  receiving 
much  attention.  Motion  parallax  also  generates  relative  visual  motion  that  provides  information  about  the 
overall  spatial  layout  of  a  scene.  The  second  motion  cue  to  depth  occurs  at  dynamic  occlusion  boundaries. 
Surfaces  on  either  side  of  such  boundaries  are  moving  visual  with  respect  to  one  another.  Until  recently, 
it  was  though  that  the  depth  cue  at  dynamic  occlusion  boundaries  was  due  to  the  appearance  (accretion) 
or  disappearance  ( deletion )  of  surface  texture  due  to  the  occluded  surface  being  progressively  uncovered  or 
covered  by  the  occluding  surface. 

We  have  shown  that  there  is  an  alternate  source  of  information  for  relative  depth  at  dynamic  occlusion 
boundaries.  This  information  comes  from  the  relative  motion  of  the  boundary  itself  with  respect  to  the 
surfaces  on  either  side.  The  investigation  of  this  new  cue  to  depth  at  surface  boundaries  is  an  excellent 
example  of  the  productive  interaction  between  research  in  computational  models  of  vision  and  research 
in  perceptual  psychophysics.  We  start  by  outlining  the  computational  theory  of  determining  depth  at 
boundaries  due  to  motion.  Next,  we  describe  experiments  designed  to  determine  whether  this  cue  is  used 
in  human  perception.  We  finish  with  a  number  of  open  questions  raised  by  this  research.  In  particular,  we 
argue  that  Marr’s  2^~D  sketch  is  inadequate  for  representing  surface  boundaries. 


2  The  Boundary  Flow  Constraint. 


Visual  motion  can  be  used  to  locate  surface  boundaries  [1].  Edges  in  an  image  due  to  motion  can  arise  from 
far  fewer  causes  than  static  image  cues  such  as  brightness,  color,  and  texture.  In  particular,  a  discontinuity 
in  optical  flow  can  occur  only  because  there  is  a  corresponding  discontinuity  in  depth  and/or  two  separate 
objects  are  moving  with  respect  to  one  another.  Perhaps  even  more  important,  motion  provides  information 


This  work  was  supported  by  AFOSR  contract  AFOSR-87-01S8,  NSF  Grant  DC R- 85 00699,  and  NICHD  Grant  HD-16934. 


and  Barrow  and  Tennenbaum  [6],  Marr  and  Barrow  and  Tennenbaum  suggest  a  computational  architecture 
with  a  bottom-up,  linear  data  flow.  Use  of  the  boundary  flow  constraint  requires  that  the  boundary  be  found, 
the  motion  of  the  boundary  determined,  and  the  motion  of  the  surrounding  surfaces  be  determined  prior 
to  the  determination  of  relative  depth.  To  complicate  the  computation  further,  the  boundary  itself  may  be 
signaled  only  by  visual  motion.  The  linear  data  flow  model  imposes  a  predefined  ordering  on  computational 
operations.  It  is  not  clear  what  ordering  could  work  for  boundary  flow  analysis  and  still  perform  adequately 
for  the  many  other  types  of  low-level  computations  that  are  required. 

There  is  an  even  more  important  implication.  Marr's  2j-D  sketch  was  proposed,  in  part,  as  an  alter¬ 
native  to  the  purely  2-D  segmentation- based  representations  that  were  then  popular.  The  2j-D  sketch  was 
considered  as  an  advantage  as  it  provided  3-D  information  about  surfaces,  while  not  requiring  the  global 
organization  of  the  image  into  "objects'’.  The  2y-D  sketch  shares  one  critical  deficiency  with  segmentation- 
based  representations,  however.  Both  are  two-dimensional  representational  structures.  Edges  in  these  rep¬ 
resentations  are  separations  between  two  regions  differing  in  some  visual  property.  What  is  missing  is  any 
indication  of  the  asymmetric  nature  of  boundaries:  edges  corresponding  to  surface  boundaries  provide  in¬ 
formation  about  the  occluding  surface,  but  not  the  occluded  surface.  Thus,  we  need  something  like  a  2|-D 
sketch  in  which  overlapping  surfaces  can  be  described. 

One  explanation  of  why  the  subjective  contour  displays  are  more  effective  than  the  objective  contour 
displays  is  that  the  particular  subjective  contour  that  was  used  is  a  less  ambiguous  indicator  of  a  depth 
discontinuity  than  is  the  simple  straight  line  which  could  have  arisen  from  many  different  causes.  The 
suggestion  is  that  some  image  cues  suggest  the  existence  of  an  "unsigned”  depth  boundary  [7],  This  cues 
indicate  that  one  surface  is  in  front  of  another,  without  indicating  which  of  the  surfaces  is  actually  nearer. 
Cues  such  as  boundary  flow  can  then  be  used  to  determine  that  sign  of  the  depth  change.  Computational 
analysis  of  this  sort  requires  a  representation  of  boundaries  more  sophisticated  than  that  provided  by  current 
models. 


References 

[1]  W  B.  Thompson,  K.M.  Mutch,  and  V.A.  Berzins,  "Dynamic  occlusion  analysis  in  optical  flow  fields." 
IEEE  Trans.  Pattern  Analysts  and  Machine  Intelligence.  July  1985. 

[2]  G.A.  Kaplan,  “Kinetic  disruption  of  optical  texture:  The  perception  of  depth  at  an  edge,”  Perception 
and  Psychophysics,  vol.  6,  pp.  193-198,  1969. 

[3]  A.  Yonas,  L.G.  Craton,  and  W.B.  Thompson,  “Relative  motion:  Kinetic  information  for  the  order  of 
depth  at  an  edge,”  Perception  &  Psychophysics,  vol.  41,  no.  1,  1987. 

[4]  L.G.  Craton  and  A.  Yonas.  “Infants’  sensitivity  to  relative  motion  information  for  depth  at  an  edge,” 
submitted  to  Child  Development. 

[5]  D.A.  Marr,  Vision,  San  Francisco:  W.H.  Freeman  and  Company,  1982. 

[6]  H.G.  Barrow  and  J.M.  Tennenbaum,  “Recovering  intrinsic  scene  characteristics  from  images,”  in  Com¬ 
puter  Vision  Systems.  A.R.  Hanson  and  E.M.  Riseman,  eds..  New  York:  Academic  nress,  1978. 

[7]  J.M.  Farber  and  A.B.  McConkie,  "Optical  motions  as  information  for  unsigned  depth,”  Journal  of  Ex¬ 
perimental  Psychology:  Human  Perception  and  Performance,  vol.  5,  no.  3,  1979. 


3 


OCCLUSION-SENSITIVE  MATCHING 

.  -'*■  n  n  m . 

u  .  nm  ;tiina  i  .  ■■  ■  -  ii<  v 

D^nartmoiLt 
■  >i  Mi:iae.:ora 

. .i\n< *«i ix'»i*5 . 


Abstract 

!  Aiei-baseti  recognition  01  partiailv  occiuuou  dj>»h  -n  is  a  .iun- 
'i-s  because  01  me  need  to  accent  mutches  tncii  our. 
-joset  ot  moaet  features  correspond  >o  image  ti-amres.  Most 
oroacues  to  moiementmg  tnese  partial  muicu—  are  turned 
-erious  nrooiems  due  to  amoieuitv.  Improvements  in  nenor- 
mce  arc  DossiDie  pv  liirectlv  exploiting  evidence  tor  occlusion 
.  •  t.e  tmaae.  uuce  a  potential  matcn  nas  been  nvanr uesizca.  oc- 
: : ; o n  cues  ran  oe  used  to  predict  portions  01  an  omen  mouei 
n  are  not  tikeiv  'o  be  visiole  in  tue  unaee.  Wo  .iP'crine  notn 
..  a.gontnm  for  matcnmg  using  occlusion  cues,  ai.u  a  met  nod 
:  'orinining  me  presence  oi  occlusion  paseu  on.v  <>n  image 

-  jporties.  ucciuiiine  sunaces  are  recognized  mm  .m  anproacn 
at  comuinos  motion  and  contrast  information,  i  me  mod  «c- 
iraieiv  lucauzos  edges,  detects  omv  t iios*  enges  m-e.v  to  corre- 
'onu  to  s u riace  ooundanes.  and  provides  an  indication  ot  wnicii 
•  io  oi  an  e<ige  corresponds  to  the  ocriuuing  sunare. 


1  Introduction. 

Matn  computational  moaeis  for  oinect  recognition  depend  m 
■oe  wav  on  maicning  two-dimensional  oinect  mooeis  to  ,m. 

ieatures.  D  maiming  is  not  limited  to  temmaie  matching 
zoritnms.  iiecentlv.  manv  recognition  aoproacnes  nave  neen 
■  eioped  wnich  use  three-dimensional  part/oirieci  models  aim 

■  nnisticatea  J-D  matcnmg  strategies.  Decause  oi  the  tiigiiiv  am- 
zuous  nature  oi  the  nrooiem.  me  linai  stage  in  such  mcinous 
•••  mcaiiv  a  verirication  step  in  winch  hv  pot  Resized  information 

jout  identihcation.  position,  and  orientation  is  useu  io  protect 
moael  back  into  tne  image  to  oe  matched  against  me  actual 
".age  ieatures. 

o  significant  problems  plague  matching  operations,  i’lrst  oi 
...  .mage  features  Mines,  corners,  tioies.  etc.i  annot  n»  ae- 
•rmined  in  a  nighly  reliable  manner.  Model  features  are  olten 
-sine  in  ’ue  image.  Manv  patterns  detected  as  image  ieatures 

■  ner  do  not  correspond  to  actual  ouiect  properues  or  are  not 
m'aineu  within  the  models,  becondiv.  in  complex  scenes  ooiects 
•■olten  oaruailv  occluded.  Dealing  wnn  occlusion  uv  accenting 
. r ’ i a i  inatcnes  increases  computational  comnicxi'v  wime  reauc- 
■i  ’ne  repabiiitv  of  the  maiciitng  process. 

. .  wnrs  was  supported  bv  \fOSRroninci  \l  t)t R.**~. I *.t  anu  v.-s F 
in  i  uc  P.  - s '.uiisy9 


Mus  paper  outlines  two  metnods  for  improving  tne  reuabiiitv  oi 
atcning  in  me  presence  oi  partial  occlusion,  first,  wp  descrioe 
tecniuoue  in  winch  visual  motion  can  ue  coniDineo  witn  static 
:ge  cues  to  improve  ute  ertectiveness  ol  me  emie  detection  nro- 
■ss.  Our  tecniuoue  recognizes  mat  static  ana  <i\  namic  edge  cues 
roviue  diiTerent  sorts  oi  information  admit  a  ommaarv.  static 
:es  sucn  as  contrast  edges  give  good  soatiai  localization.  Dut  are 
iDioct  to  highly  amoiguous  lnieroretanons.  -  mat  motion  is  a 
•oust  indicator  ot  suriace  uounaaries.  but  does  not  vteid  pre- 
-,se  inlormation  on  me  iocation  oi  die  bounaarv.  1'he  approacu 
.  ven  here  accurateiv  locates  edges  uue  'o  -urtace  ooundanes. 
Mthout  generating  manv  '".aise"  edges.  Even  more  importantiv. 
me  metliod  gives  a  direct  indication  oi  winch  siue  oi  an  edge 
orrespontis  to  the  occluding  suriace  generating  the  euee. 

1' tie  second  tecnmaue  uses  information  aoout  occlusion  to  aiu  in 
•  :ie  matcnmg  process.  Most  existing  matching  algorithms  tiiat 
re  tolerant  ol  occiusion  look  for  a  partial  correspondence  i>e- 
veen  moaei  and  image  ieatures.  if. t  naruai  matcn  is  found,  un¬ 
matched  mouei  components  are  assumed  to  ne  muden  ov  an  or- 
.usion.  This  anproacn  leads  to  difficulties  because  oi  the  cnances 
m  partiai  matches  occurring  couiciuentaliv.  In  our  memoa. 
•>rmauon  aoout  occiusion  boundaries  is  usou  to  exoucitiv  iuon- 
: f v  model  features  that  wiil  not  be  visible  in  tne  image.  Most 
,i  the  remaimne  model  features  snouid  lie  tinaabie  if  the  matcn 
-  in  fact  correct.  Occluded  model  features  are  determined  oaseu 
.rectlv  on  image  pronemes  ai  nounuaries.  rntner  tnan  lust  on 
■  e  absence  oi  an  image  leature  at  some  expected  location.  I"..1.' 
psnlt  is  a  sienuirant  decrease  in  amoiguity. 


-  Background. 

2.1  Combining  motion  and  contrast  information 
for  edge  detection. 

'cementation  schemes  which  combine  motion  anu  contrast  infor- 
..aiion  date  back  to  at  least  to  the  work  of  Jain.  Martin,  anu 
tagarwai  M:.  Tiiis  annroacn  used  a  diiference  operator  between 
wo  irames  to  find  areas  in  me  iniace  tiiat  nau  ciiangea  due  to 
onon.  A  -lane  segmenrer  was  men  run  wiriun  mese  areas  to 
ad  the  iinnnaaries  ot  »iie  moving  regions,  f  hompson  used  a  re- 
.  on  merger  anproacn  that  grouped  pixels  into  regions  naseo  on 
-duiianties  in  contrast  ana  motion  inlormation  •«!  .  Haves  anu 
Min  devpiooea  an  edge  detector  based  on  a  product  ot  the  spa- 
.ai  gradient  ana  a  temporal  operator  .1  .  M.v  Purpose  was  to 


uim  sensmviiv  to  areas  signaled  by  both  static  and  dynamic 
•■fleets  More  recently.  Gamble  and  Poggio  have  developed  a 
'  larkov  Random  Field  model  for  recovering  optical  how  in  a  man¬ 
ner  tr.at  integrated  contrast  boundaries  with  visual  motion  ’4j. 
Their  anproacn  constrained  discontinuities  in  rfow  to  occur  oniy 
. r  intensity  eases. 

Roiativeiv  little  work  has  been  done  on  differentiating  between 
"  Ciuame  ana  occiuaed  surfaces  without  resort  to  ritting  object  or 


estimates  of  model  features  likely  to  be  hidden  by  occlusions  ilol. 
Evidence  for  visibility  and  occlusion  came  from  a  presumption 
that  visible  features  were  spatially  adjacent,  rather  than  from 
any  three-dimensional  analysis  of  the  imagery. 


3  Motion-based  Segmentation. 


'.art  models.  Waltz  used  constraints  associated  with  line  drawing 
•'rices  to  identifv  extremal  contours  and  to  determine  which  side 
>f  suca  a  contour  corresponded  to  an  occluding  surface  i5[.  Smit- 
!ey  ana  Sajcsy  identified  occluding  surfaces  in  stereo  imagery  by 
onioarine  correlations  between  frames  for  images  patches  on  ei- 
•ne r  sue  of  a  boundarv  [tsj.  If  the  correlations  differed  substan- 
■;ailv.  the  boundarv  was  assumed  to  be  due  to  occlusion  and  the 
region  with  the  highest  correlation  between  views  was  assumed 
•o  rorresoond  to  the  occluding  surface.  Thompson.  Mutch,  and 
Berzins  snowed  how  eoees  in  optical  flow  could  be  used  to  recog¬ 
nize  occiudine  surfaces  i Tj .  Their  approach  is  discussed  in  more 
ietau  in  section  3. 


2.2  Matching. 

Template  matching  was  one  of  the  first  methods  proposed  for  the 
visual  recognition  of  objects.  Template  matching  utilizes  a  corre- 
.ation  measure  between  one  or  more  model  patterns  and  images 
ro  be  analyzed.  Invariance  to  translation  and/or  rotation  can  be 
obtained  by  appropriate  scanning  of  the  template  pattern  over 
an  image.  While  useful  in  some  applications,  template  matching 
lurfers  from  problems  due  to  computational  complexity  and  is 
maole  to  deal  effectively  with  the  matching  of  three-dimensional 
models  to  two-dimensional  imagery. 

Recognition  of  three-dimensional  objects  is  often  done  by  us¬ 
ing  configurations  of  image  features  to  estimate  how  a  three- 
dimensional  object  is  being  projected  into  the  two-dimensional 
mage.  The  ooject  model  is  subjected  to  the  appropriate  pro 
•  ction.  resulting  in  a  prediction  of  the  objects  appearance  in 
■  he  image.  A  verification  process  is  used  to  determine  if  the 
predicted  configuration  of  object  features  actually  appears  in  the 
mage  ie.g..  [3.9.10]).  Such  methods  avoid  many  of  the  problems 
associated  with  straightforward  template  matching. 

Recognition  of  partially  occluded  objects  has  been  a  major  chal¬ 
lenge  for  manv  years.  Most  approaches  attempt  to  find  good 
partial  matches  between  subsets  of  object  models  and  image  fea- 
'  ires  i  e  g  .  ,11.12.13]).  Allowing  for  partial  matches  increases 
'he  likelihood  of  false  positive  classification  errors.  In  addition, 
'he  extraneous  configurations  of  boundaries  generated  by  over¬ 
lapping  objects  causes  additional  confusion. 

•iome  preliminary  attempts  have  been  made  to  directly  incorpo¬ 
rate  occlusion  information  into  the  matching  process.  Fisher  de¬ 
veloped  evidence  for  extraneous  or  missing  image  features  based 
in  boundary  topology  and  other  information  about  the  depth 
ordering  of  surfaces  ( 14).  Specialized  heuristics  were  used  to  dis¬ 
count  the  irrelevant  mismatches  during  a  verification  stage.  Cas- 
'an  used  the  results  of  a  partial  matching  process  to  determine 


Thompson.  Mutch,  and  Berzins  develop  an  edge  detector  for  op¬ 
tical  flow  fields  [7].  One  important  aspect  of  this  work  is  that 
motion-based  edge  detection  directly  yields  information  about 
which  side  of  the  edge  corresponds  to  the  occluding  surface. 
This  identification  is  based  on  a  comparison  between  the  opti¬ 
cal  flow  on  either  side  of  the  boundarv  and  the  visual  motion 
of  the  boundary  itself.  (Aperture  effects  usually  require  that  all 
image  flows  be  projected  onto  an  axis  parallel  to  the  normal  to 
the  edge.)  The  principle  underlying  the  identification  of  occluded 
surfaces  is  summarized  in  the  boundary  flow  constraint: 

At  a  surface  boundary,  the  visual  motion  of  the  bound¬ 
ary  itself  is  the  same  as  the  visual  motion  of  the  sur¬ 
face  generating  the  boundary. 

At  a  boundary,  we  need  only  look  at  the  image-plane  motion  of 
the  boundary  (the  boundary  flow i  and  the  optical  flow  immedi¬ 
ately  to  either  side.  Optical  flow  inconsistent  with  the  boundarv 
flow  corresponds  to  an  occluded  surface. 

One  problem  with  exploiting  the  boundary  flow  constraint  is  the 
apparent  need  to  determine  the  actual  motion  of  the  boundary. 
In  many  circumstances,  this  can  result  in  a  difficult  correspon¬ 
dence  problem.  [7]  demonstrated  how  the  motion  of  optical  flow 
edges  can  be  related  to  the  boundary  flow  constraint  in  a  man¬ 
ner  that  does  not  explicitly  compute  boundary  motion.  In  that 
work,  the  boundary  that  was  moving  was  itself  indicated  by  a 
motion  cue.  Here,  we  extend  the  result  to  show  how  any  zero¬ 
crossing  style  edge  operator  can  be  easily  used  to  distinguish 
between  occluding  and  occluded  surface.  As  shown  in  [7],  with 
an  appropriate  change  of  coordinate  systems  it  is  sufficient  to 
consider  only  two  cases.  In  one.  two  surfaces  are  moving  towards 
one  another  with  equal  but  opposite  optical  flows.  In  the  sec¬ 
ond  case,  the  surfaces  are  moving  away  from  one  another  with 
equal  but  opposite  flows.  Over  time,  the  Laplacian  pattern  at  the 
boundary  will  move  with  the  surface  to  which  it  is  attached.  If  a 
zero-crossing  edge  detector  is  applied  to  an  optical  flow  pattern, 
all  that  is  necessary  to  classify  the  edge  is  to  observe  the  sign  of 
the  Laplacian  pattern  as  it  translates. 

The  situation  is  somewhat  more  complicated  if  edges  are  sig¬ 
naled  by  some  feature  other  than  optical  flow.  In  such  cases,  it 
is  necessary  to  consider  both  the  contrast  orientation  of  the  edge 
and  the  pattern  of  motion  to  either  side.  The  sign  of  the  Lapla¬ 
cian  function  can  be  used  to  determine  the  direction  of  boundary- 
movement  relative  to  the  direction  of  the  gradient  at  the  bound¬ 
ary.  If  we  observe  the  value  of  the  Laplacrn  at  the  zero  cross¬ 
ing  and  that  value  goes  negative,  then  we  snow  that  the  edge 
has  moved  in  the  direction  of  the  gradient.  If  the  value  of  the 
Laplacian  goes  positive,  then  the  edge  motion  is  in  the  direction 
opposite  to  the  gradient.  It  is  still  necessary  to  compare  edge 


4  Occlusion-Sensitive  Matching. 


notions  anu  surface  motions.  Again  usme  tne  coorumate  system 
rar.siorm,  we  need  oniv  determine  w nether  tne  tun  suriaces  are 
.nine  towards  or  a  wav  trom  earn  ottier.  it  is  not  necessary  to 
:a:i;itanvciv  estimate  actual  surtace  anu  boundary  flows. 

1  iv.iowmg  aigoritnm  implements  tins  process: 

r.nd  an  eaee  point,  z.-, .  in  irame  in.  Comouie  me  gradient 
“  r ■ .  •mere  ••  ...  is  any  perceivaoie  tunction  oi  mat 
irresnonus  to  surtace  oroDerues. 

.  Fro 'err  ail  optical  flow  values  ontoan  axis  Daranei  to  Vn  E,i. 

Normalize  coordinates  bv  locating  an  evaluation  Doint  f,  = 
F  -  /,  in  frame  i| .  wiiere  /,  is  the  average  inter-trame  now 
,:i  the  neigiiboritood  of  fn- 

....  The  direction  of  C ii  x.Fi  points  towards  the siue  of  the  bound- 
i - v  corresnonuing  to  the  orciuaing  snriace  if  N"-r?  <1  jT|  i  is 
negative  ana  the  two  surtaces  are  aoproacmng  one  anouier 

•  r  if  V:C  a  z,  t  is  positive  and  the  two  surtaces  are  scr>a- 
ratine. 

i.  The  direction  of  V it  j0 j  points  towaras  the  siae  ot  the  botina- 
irv  opposite  tne  occluding  surtace  it  nx,  ;  is  posi- 

•  ;ve  and  the  two  surfaces  are  aDproacnine  one  anotner  or  if 
T:G  nr,  i  is  neeative  ana  the  two  surfaces  are  separating. 

■  )'e  tuat  if  surface  motion  is  oarailel  to  'lie  bounaarv.  no  ueter- 
. mation  of  occluding  ana  occluded  surtaces  is  maue.  In  fact,  in 

•  ..is  situation  no  definitive  determination  is  oossiuie  based  oniv 
ii  visual  motion. 

One  auvantage  of  this  particular  algorithm  is  mat  it  directlv 
roviaes  a  mecnamsm  for  comoining  moiion-baseii  ooundarv  <ie- 
ouon  with  static  edge  cues.  Discontinuities  ill  optical  flow  can 
:uv  occur  due  to  uiscontinumes  in  depth  auu/or  aue  to  two 

-  irtaces  moving  relative  to  one  anotner.  F.ius.  flow  edges  can 
•rise  from  tar  fewer  causes  than  edges  due  to  cnanges  in  inten- 

-  'v.  texture,  color,  etc.  I'ufortunatelv.  flow  edges  are  aifficult  to 
•  anze  nreciseiv.  I  lie  above  algorithm  can  oe  usea  to  filter  out 
.1  static  cages  mat  are  not  associated  witn  a  cuange  in  optical 
'ow  over  tne  neighborhood  of  the  edge.  The  elfect  is  to  use  mo¬ 
on  to  reduce  amDigmty.  while  using  the  static  cues  to  preserve 

oraiizauon.  In  our  current  algorithm,  we  are  oniv  interested  in 
uiinuarv  points  at  which  we  can  differentiate  between  occluding 
nu  occiuued  suriaces.  As  a  result,  we  delete  an  edge  elements 

•  uat  do  not  nave  some  differential  optical  flow  along  an  axis  oer- 
'■ndicular  to  the  edge.  This  is  easiiv  done  uv  moaifving  tne 
wove  algorithm  as  follows: 

!!>.  If  the  magnitude  of  VJG  ii  x, )  is  ciose  to  zero,  delete  the 
“dge  element  at  xo  from  further  consideration. 

1  'uiv  a  pit  more  compiexitv  is  required  in  order  to  recognize  edges 
im  differential  motion  raiv  tangential  to  the  edge  orientation. 

-  icn  edges  signal  suriace  uounaanes.  but  it  is  not  possiule  m 
..stinguish  between  the  occluding  and  occluded  sides. 


na'.e  .jeveioneu  a  simme  model  of  how  occlusion  iiuormation 
.gur  ti"  'i-eu  to  am  in  recognition,  i . . e  moaei  uses  occlusion 
;es  arising  nom  me  bounuarv  flow  constraint  to  reduce  amoi- 
-  utv  m  template  matemng  applied  to  paruallv  occluded  ouiects. 
:i  presenting  mis  mouei.  our  aim  is  to  demonstrate  the  utiiitv  oi 
.vornoratiiig  occlusion  iiuormation  directlv  into  tne  recognition 
rocess.  fi.e  stjecirics  oi  the  aigontnm  are  for  purposes  oi  liius- 
•vion  oniv.  fae  anproacn  wiil  worn  for  veriticaiion  as  wen  as 

•  -.noam  template  matemng.  Anv  occlusion  cue  can  be  used:  me 
etnou  is  not  limited  10  using  just  motion  iiuormation.  .'lore 
iicient  aim  tenable  implementations  are  oossioie.  fl.e  basic 
: inclines  oi  our  approacu  can  ue  summarized  as  loiiows: 

•  i ir termine  n  nwtchinn  " score  *  imsen  on  seitrrnma  'or  nmaei 
■  ' itnrrs  m  the  in  lane. 

•  Introduce  nt  naitttes  tor  mnnei  features  not  m  the  mnnje.  Pit 
oon  if  'here  is  not  evidence  tor  the  natures  ttenio  nuuien 
■i  nn  nrciusion. 

•  iro  not  introtiuce  penalties  for  imane  features  not  necounteu 

■r  in  me  mouei. 

,.i  the  exammes  presented  beiow-.  we  deline  me  matemng  score  'o 
e  me  nercemageoi  mouei  features  lound  in  me  image.  Tins  ;s 
one  uv  computing  the  ratio  ol  matclied  inodei  features  to  noten- 
;miv  matcuable  model  features,  flic  features  used  in  our  simoie 
xamme  are  silhouette  edge  elements.  Oniv  image  edges  witn 
inferential  motion  across  tne  edge  are  used.  A  smaii  distance 
•oieration  is  allowed  for  to  accommodate  noise  anu  omer  uistor- 
ons.  Information  about  occluding  edges  in  the  image  is  used 
u  two  wavs.  First  of  all.  the  model  /non-mode!  sides  ot  the  tern- 
•i.ite  edge  must  be  compatible  with  the  occiuuine/occiuded  sines 
u  the  image  edges.  I  Note  that  this  is  a  stronger  roouiremeni  man 
ust  orientational  compatibilitv.  I  Seconuiv.  a  inodei  edge  element 
-  considered  potentially  matciiable  if  it  is  not  mnskeei.  When  a 
nodel  is  nemg  matched  at  a  narticular  image  location,  massing 
veurs  ii  there  are  significant  occlusion  euges  in  me  image  witmn 
tie  interior  of  the  model.  Marking  regions  are  'grown  outward 
torn  tne  occluding  side  of  anv  interior  image  edges.  To  assure 
uat  it  will  not  extend  bevond  anv  occluding  suriace.  me  masK- 
g  region  onus  at  the  lirst  image  edge  readied.  In  our  current 
mnlementation.  matemng  is  lirst  done  without  using  t tic  mass- 
ug  operation.  Areas  of  partial  match  are  men  reevaiuateu  using 
ne  massing  procedure. 

\  set  of  simple  examples  was  created  to  test  our  aoproacn  of  oc¬ 
clusion  sensitive  matching.  We  used  artificially  created  objects 
o  better  control  for  ambiguity  in  matching.  However,  the  exam¬ 
ples  ail  involve  reai  imagery  and  automatically  determined  optical 
low.  Figure  1  shows  a  set  of  fourteen  ooieci  models.  Two  actual 
•niects  were  used,  one  T  shaped,  the  other  L  sliaDeo.  Figure  1 
mows  one  irame  from  a  sequence  m  which  the  T  is  moving  belnno 

•  wail  to  me  right.  The  wan  is  partiailv  occluding  tne  T.  As  a 
"suit,  simnie  template  matching  mav  not  he  effective  for  recogni- 
:on.  Figure  t  shows  contrast  edges  in  the  T  seoueuce.  Fie  eoges 
'.  ''reqetermined  using  a  large  kernel  zero-crossing  operator.  Fig- 
:»  I  «liows  mot  ion /contrast  edges  determined  bv  deleting  edge 


Figure  7:  Masked  portions  of  T  model. 


Figure  3:  T  contrast  edges. 


elements  in  figure  3  that  are  not  associated  with  differential  op¬ 
tical  flow  across  the  edge.  Figure  5  shows  the  position  of  the 
T  model  in  the  image  resulting  in  the  highest  matching  score. 
Figure  €  shows  the  unmatched  edges  within  the  T  model  when 
applied  to  the  image  at  the  location  shown  in  figure  5.  The  hash 
marks  along  the  edges  point  to  the  occluding  surface,  as  indi¬ 
cated  by  the  boundary  flow  constraint.  Finally,  figure  7  shows 
the  portions  of  the  T  model  which  have  been  masked  as  a  result 
of  the  internal  edges  shown  in  figure  6. 


Table  1  shows  the  matching  scores  for  all  model  types  evaluated 
against  the  T  and  L  sequences.  The  highest  scores  in  each  column 
have  been  italicized.  The  models  are  matched  against  the  raw 
contrast  edges,  the  motion/contrwt  edges,  the  motion/contrast 
edges  using  the  model/non-model  orientational  compatibility  con¬ 
straint.  and  finally  using  all  of  the  matching  constraints  described 
above  (differential  motion,  modei/non-model  edge  orientation, 
and  masking).  The  data,  while  currently  limited  to  a  few  test 
cases,  suggests  that  using  occlusion  information  can  reduce  am¬ 
biguity  in  matching.  Using  all  of  the  available  matching  con¬ 
straints,  both  examples  are  correctly  classified.  Using  either  tra¬ 
ditional  template  matching  or  using  only  a  subset  of  the  matching 
constraints  causes  one  or  both  of  the  images  to  be  misclassified. 


Figure  4:  T  mocion/contrast  edges. 


5  S  ummary. 

Edge  detection  is  possible  based  on  both  contrast  and  motion 
nformation.  Contrast  edges  can  arise  from  a  iarge  numDer  of 
auses  and  thus  are  difficult  to  accurately  interpret.  Motion 
■'dees  are  always  associated  with  depth  and/or  surface  bound¬ 
aries.  but  are  difficult  to  localize  precisely.  The  motion-based 
segmentation  technique  described  above  combines  motion  and 
-ontrast  cues  in  an  integrated  edge  detection  process.  Locabza- 
'ion  is  based  on  contrast  edges,  while  motion  information  is  used 
•o  niter  out  edges  not  bkely  to  correspond  to  surface  boundaries. 
The  method  further  gives  a  direct  indication  of  the  side  of  the 
boundary  corresponding  to  the  occluded  surface. 

Identification  of  occluded  and  occluding  surface  can  significantly 
aid  in  recognition  tasks.  We  have  presented  a  simple  matching  al¬ 
gorithm  in  which  the  presence  of  occlusion  boundaries  is  used  to 
avoid  penabzing  matches  for  situations  in  which  model  features 
are  hidden  from  view  by  other  objects.  While  our  alg  nthm 
has  been  described  within  the  context  of  template  matcmng,  it 
,s  equally  appropriate  when  verifying  hypothesized  matches  sug¬ 
gested  bv  more  complex  three-dimensional  reasoning  processes. 


References 

M!  R.  Jain.  W.N.  Martin,  and  J.K.  Aggarwal.  "Segmentation 
through  the  detection  of  changes  due  to  motion".  Computer 
Graphics  and  Image  Processing,  11:13-3-1.  1979. 

[2]  W.B.  Thompson.  "Combining  motion  and  contrast  for  seg¬ 
mentation".  IEEE  Trans,  on  Pattern  Analysis  and  Machine 
Intelligence.  PAMI-2:543-549.  November  1980. 

31  S.M  Haynes  and  R.  Jain.  "Detection  of  moving  edges'. 
Computer  Vision.  Graphics  and  Image  Processing.  21:345- 
367.  March  1983. 

”4j  E.  Gamble  and  T.  Poggio,  Visual  Integration  and  Detection 
oj  Discontinuities:  The  Key  Role  of  Intensity  Edges,  Ai 
Memo  970.  MIT.  1987. 


5]  D.  Waltz.  "Understanding  line  drawings  of  scenes  with  shad¬ 
ows".  In  P.H.  Winston,  editor.  The  Psychology  of  Computer 
Uision.  McGraw-Hill.  New  York.  1975. 

'6i  D  L.  Smitlev  and  R.  Bajcsv.  "Stereo  orocessmg  of  aerial, 
urban  images'.  Proc.  Seventh  Int.  Conference  on  Pattern 
Recognition.  433-435.  1984. 

7]  W.B.  Thompson.  K.M.  Mutch,  and  Y.A.  Berzins.  "Dy¬ 
namic  occlusion  analysis  in  optical  flow  fields IEEE  Trans, 
on  Pattern  Analysts  and  Machine  Intelligence.  PAMI-7:37 4- 

383.  July  1985. 

?!  L.G.  Roberts.  "Machine  perception  of  three-dimensional 
sobds".  In  J.T.  Tippett  et  al..  editors.  Optical  and  Electro- 
Optical  Information  Processing.  MIT  Press.  Cambridge. 
MA.  1965. 

[9]  W.E.L.  Grimson.  "Recognition  of  object  families  using  pa¬ 
rameterized  models".  Proc.  First  International  Conference 
on  Computer  Vision.  93-101.  1987. 

MO]  D  P.  Huttenlocherand  S.  UUman.  "Object  recognition  using 
abgnment".  Proc.  First  International  Conference  on  Com¬ 
puter  Vision.  102-111.  1987. 

11]  W.A.  Perkins.  "Model-based  vision  system  for  scenes  con¬ 
taining  multiple  parts'.  Proc.  Fifth  International  Joint 
Conference  on  Artificial  Intelligence.  678-684.  1977. 

'  12}  J.W.  McKee  and  J.K.  Aggarwal.  "Computer  recognition  of 
partial  views  of  curved  objects  '.  IEEE  Trans,  on  Comput¬ 
ers.  C-26:790-800.  1977. 

[13]  R.C.  Bolles.  'Robust  feature  matching  through  maximal 
cliques".  Proc.  SPIE  Technical  Symposium  on  Imaging  Ap¬ 
plications  for  Automated  Industrial  Inspection  and  Assem¬ 
bly,  1979. 

'  14]  R.  Fisher.  "Using  surfaces  and  object  models  to  recog¬ 
nize  partially  occluded  objects".  Proc.  Eighth  International 
Joint  Conference  on  Artificial  Intelligence.  989-995.  1983. 

15]  S.  C  as  tan.  J.  Shen.  and  N.Q.  He.  "A  method  for  recognition 
and  positioning  of  partiallv  observed  objects  .  Proc.  Eighth 
hit.  Conference  on  Pattern  Recognition.  1986. 


Model 

T  image  sequence 

L  image  aequence 

contrast 

edjea 

motion/ 

contrast 

model/ 

non-model 

occlusion 

masking 

contrast 

edges 

motion/ 

contrast 

model/ 

non-model 

>ccjusion 

masking 

Mi  (Li 

635 

422 

345 

439 

d47 

594 

550 

o37 

Ml  (Croesi 

659 

562 

511 

542 

754 

517 

448 

454 

Mi  ( Square t 

.642 

218 

215 

296 

512 

260 

192 

192 

Mi  ( Asymmetric  triaagle) 

628 

320 

456 

603 

416 

321 

257 

27. 

\f$  (Quadrilateral) 

652 

J80 

295 

526 

T6I 

.377 

338 

338 

M»  (Rectangle) 

.800 

348 

388 

638 

704 

504 

356 

156 

Mr  (T) 

670 

532 

494 

567 

543 

127 

236 

\t%  (Narrow  tnanfie) 

665 

543 

320 

520 

715 

498 

412 

412 

.tft  (Inverted  tnangie) 

.769 

474 

446 

475 

713 

478 

430 

M,o  (Narrow  diamond) 

621 

57/ 

566 

606 

797 

6  iS 

571 

571 

Mu  (Standard  diamond! 

583 

456 

406 

437 

772 

594 

556 

359 

.Mis  (Broad  triangle) 

563 

340 

398 

525 

716 

425 

372 

380 

Si  is  (Tilted  trapetotdl 

635 

450 

375 

551 

745 

625 

Mu  (Tilted  rectangle) 

574 

426 

413 

139 

702 

603 

554 

56 1  I 

i 


Table  1:  Matching  scores  -  all  models  applied  to  T  and  L  sequences 


Structure-From-Motion  By 
Tracking  Occlusion  Boundaries 


William  B.  Thompson 

Computer  Science  Department 
University  of  Minnesota 
Minneapolis,  MN  55455 


To  appear  in  the  Proceedings  of  the  IEEE  Workshop  on  Visual  Motion ,  1989. 


Abstract 

Active  visual  tracking  of  points  on  occlusion  boundaries  can  simplify  certain  compu¬ 
tations  involved  in  determining  scene  structure  and  dynamics  based  on  visual  motion. 
Two  such  techniques  are  described  here.  The  first  provides  a  measure  of  ordinal  depth 
by  distinguishing  between  occluding  and  occluded  surfaces  at  a  surface  boundary.  The 
second  can  be  used  to  determine  the  direction  of  observer  motion  through  a  scene. 


1  Introduction. 


The  study  of  computational  models  of  active  vision  has  received  a  flurry  of  recent  activity  (e.g., 
[1,2,3]).  These  and  similar  papers  have  investigated  ways  in  which  the  visual  process  can  be 
simplified  and/or  extended  if  active  control  is  available  over  camera  motion.  Much  of  this  work 
has  dealt  specifically  with  the  issue  of  eye/camera  rotation  [2,3].  The  ability  to  visually  track 
environmental  points  can  lead  to  significant  simplifications  in  computing  visual  properties.  This 
note  describes  two  such  simplifications,  both  involving  the  tracking  of  edge  points  corresponding 
to  occlusion  boundaries.  The  first  technique  determines  local  depth  orderings  by  recognizing  which 
side  of  a  boundary  corresponds  to  an  occluding  surface.  The  second  technique  is  able  to  estimate  the 
direction  of  observer  motion  in  a  simpler  manner  than  most  other,  previously  proposed  approaches. 

The  methods  described  below  are  most  effective  when  the  following  three  assumptions  hold:  An 
observer  is  moving  though  an  environment  in  which  at  most  a  relatively  small  portion  of  the  visual 

This  work  w as  supported  by  AFOSR  contract  AFOSR-87-0168  and  NSF  Grants  DCR-8500899  and  IRI-8722576. 


1 


field  corresponds  to  moving  objects.  Occlusion  boundaries  involving  significant  changes  in  depth 
commonly  occur.  The  observer  is  able  to  keep  a  selected  edge  element  centered  in  the  field  of 
view.  This  last  assumption  is  at  least  plausible  in  most  natural  situations  where  boundaries  are  not 
straight  and/or  surfaces  are  visually  textured.  Analysis  will  be  based  on  optical  flow  in  the  image 
near  the  tracked  edge  element.  Note  that  in  biological  terms,  this  corresponds  to  retinal  flow,  not 
the  Gibsonian  idea  of  flow  in  the  “optic  array”. 


2  Analysis. 


Figure  1:  Optical  flow  near  a  surface  boundary. 

Visual  motion  depends  on  the  instantaneous  translational  velocity  of  the  eye/camera,  the  range  to 
surface  points  in  the  scene,  and  the  rotational  velocity  needed  to  track  a  particular  scene  point. 
Figure  1  illustrates  the  situation  in  the  neighborhood  of  a  boundary  when  no  rotation  is  occurring. 
Sn  corresponds  to  a  near  surface,  which  has  associated  optical  flow  /n.  Sn  is  occluding  a  more 
distant  surface  Sj,  with  associated  flow  fj.  The  boundary  itself  moves  in  the  image  with  flow 
/(,.  From  [4],  we  know  that  close  to  an  occlusion  boundary  the  visual  motion  of  the  occluding 
surface  and  the  visual  motion  of  the  boundary  are  the  same.  Thus,  /<,  =  /„.  Figure  2  describes 


Figure  2:  Optical  flow  with  edge  tracking. 


2 


the  situation  when  the  edge  is  being  accurately  tracked.  Tracking  is  effected  by  introducing  an 
eye/camera  rotation  of  velocity  u>  =  (A,B,Q)T  which  exactly  compensates  for  /&.  This  also  has  the 
effect  of  nulling  out  /„.  The  only  visible  flow  left.  f\  -  fd-  fb ,  is  associated  with  the  more  distant 
surface. 


A  simple  set  of  equations  defines  the  relationship  between  optical  flow,  motion,  and  scene  structure 
[5].  Using  a  planar  imaging  system,  perspective  projection,  and  a  coordinate  system  centered  at 
the  camera  with  z  axis  along  the  line  of  sight: 


U  =  U(  +  ur 

V  =  Vt+  Vr 

(1) 

where  u  and  u  are  the  x  and  y  components  of  flow,  z 
( x,  y),  translational  velocity  is  T  =  (U,  V,  W)T,  and 

is  the  distance  to  the  surface  point  imaged  at 

-U  +  xW 

U(  =  -  , 

z 

- V  4-  yW 
vt  =  - 2 — 

z 

(2) 

ur  =  Axy  -  B(x 2  4-  1)  , 

vT  =  A(y 2  +  1)  -  Bxy 

(3) 

The  optical  flow  equations  simplify  considerably  at  the  center  of  the  field  of  view: 

r  ~U 

hm  ut  = -  , 

x,V—0  z 

v  ~V 

hm  vt  = - 

x,y—0  z 

(4) 

lim  ur  =  -B  , 

x.y—O 

lim  vT  —  A 

x.y— >0 

(5) 

If  the  tracked  boundary  element  is  centered  within  the  field  of  view  and  if  surface  flow  is  measured 
near  this  center,  then  fb,  /„,  and  fd  are  all  determined  by  equations  4-5. 


Utilizing  the  fact  that  zn  <  zd 

n  =  . 


f\  is  thus  a  scaled  version  of 
We  can  now  summarize  the  two  algorithms  for  analyzing  visual  motion  using  edge  tracking: 

•  Identification  of  occluding  surface. 

When  a  boundary  element  is  visually  tracked,  the  region  to  the  side  of  the  boundary  corre¬ 
sponding  to  the  occluding  surface  will  have  near-zero  image  flow.  The  region  to  the  side  of  the 
boundary  corresponding  to  the  occluded  surface  will  in  general  be  associated  with  significant 
visual  motion. 


,  we  can  now  compute  fd: 

fd~fb~fd~  fn 


(-U  -u  n  -V  -V  \ 

- B - +  5,  - 4-  A - A) 

V  Zn  Zd  Zn  ) 


\  1  \ 

(  1 

1  \  _  \ 

- \U  , 

- V 

V  \  zn  zd) 

\Zn 

Zd)  ) 

( aU,aV )  ,  a  >  0 

he  projection  of  the  translation  vector  onto  the  image  plane. 


•  Determination  of  direction  of  observer  motion. 

When  a  boundary  element  is  visually  tracked,  optical  flow  due  to  the  more  distant  surface 
indicates  the  direction  of  observer  motion.  The  flow  vectors  point  in  the  direction  of  the 
image  location  corresponding  to  the  line  of  sight  coincident  with  the  direction  of  translational 
motion.  (This  location  is  commonly  called  the  “focus  of  expansion”,  but  the  term  is  only 
strictly  correct  for  purely  translational  motion.)  Multiple  fixations  over  the  field  of  view  can 
be  used  to  solve  for  the  actual  direction  of  translation. 


3  Discussion. 


Both  algorithms  offer  significant  computational  simplifications  over  alternate  approaches.  The  few 
previously  reported  optical  flow  based  techniques  for  differentiating  between  occluding  and  occluded 
surfaces  require  reasonably  accurate  flow  estimates  on  either  side  of  the  boundary  [4,6].  The  method 
reported  here  only  requires  that  regions  of  significant  image  motion  be  recognized.  It  is  far  easier  to 
determine  that  image  motion  is  occurring  than  it  is  to  estimate  the  specific  characteristics  of  that 
motion.  When  eye/camera  rotations  are  possible,  the  determination  of  observer  motion  is  difficult 
because  of  the  complex  manner  in  which  translational  and  rotational  motion  interact  to  generate 
an  optical  flow  field  (see  [5]).  Edge  tracking  eliminates  the  complexity  associated  with  rotation. 

It  is  important  to  note  that  eye  tracking  does  not  reduce  the  conceptual  difficulties  associated 
with  these  two  tasks.  Eye  tracking  provides  neither  additional  constraints  nor  other  sorts  of  new 
information.  This  is  easily  seen  by  recognizing  that  all  of  the  information  in  the  tracking  image  is 
available  in  an  image  of  the  same  scene  without  tracking.  Tracking  is  accomplished  by  generating 
a  rotation  of  the  eye/camera  system  based  on  estimates  of  image  drift  such  as  optical  flow  at 
the  image  center.  Once  this  rotational  velocity  is  determined,  a  non-tracking  image  sequence  can 
trivially  be  converted  into  the  equivalent  tracking  sequence  using  equation  3.  In  fact,  both  of  the 
algorithms  described  above  are  really  special  cases  of  methods  already  presented  in  the  literature. 
Occlusion  analysis  is  described  in  [4].  The  method  for  determining  direction  of  motion  is  essentially 
equivalent  to  that  described  in  [7],  What  is  different  are  the  simplifications  in  actual  algorithms, 
not  the  underlying  computational  theory. 

The  effectiveness  of  these  two  algorithms  is  limited  by  the  accuracy  with  which  boundaries  can  be 
tracked  and  by  the  visual  texture  present  adjacent  to  the  boundaries.  While  biological  systems 
are  capable  of  tracking  environmental  points  with  relatively  high  precision,  the  computer  vision 
community  has  only  recently  begun  to  study  the  engineering  difficulties  involved  in  tracking  features 
in  complex  scenes.  Aperture  effects  are  a  further  consideration.  It  is  generally  felt  that  only  the 
component  of  motion  perpendicular  to  an  edge  can  be  determined.  This  is  actually  only  true  if  the 
edge  does  not  curve  (e.g.,  see  [8]).  Reasonably  reliable  two-dimensional  tracking  should  be  possible 
for  most  realistic  scenes,  though  sufficient  experimentation  has  not  yet  been  done.  Both  algorithms 
depend  on  recognizing  aspects  of  image  motion  in  the  neighborhood  of  the  tracked  edge.  This  is 
most  easily  accomplished  if  both  surfaces  are  visually  textured.  This  will  hold  in  many  but  not  all 
scenes.  We  do  know  that  human  vision  is  capable  of  “filling  in”  the  motion  of  homogeneous  portions 


4 


-  «,  >  * 


of  surfaces.  We  do  not  as  yet  have  good  computational  models  of  how  this  is  done,  however. 

Open  questions  remain  as  to  whether  or  not  biological  vision  systems  actually  use  methods  of  this 
sort  to  simplify  the  determination  of  scene  structure  and  motion  trajectories.  To  answer  these 
questions,  we  need  to  know  more  about  fixation  patterns  in  realistic  dynamic  environments  and 
about  how  fixation  and  eye  tracking  affect  the  perception  of  relative  depth. 


Acknowledgement. 

Dana  Ballard  first  got  me  thinking  about  the  role  of  tracking  in  visual  motion  analysis. 


References 

[1]  R.  Bajcsy,  '‘Active  perception  vs.  passive  perception”,  Proc.  IEEE  Workshop  on  Computer 
Vision ,  55-59,  1985. 

[2]  J.  Aloimonos,  I.  Weiss,  and  A.  Bandyopadhyay,  “Active  vision”,  Proc.  First  International 
Conference  on  Computer  Vision,  35-54,  1987. 

[3]  D.H.  Ballard,  Eye  Movements  and  Spatial  Cognition,  Technical  Report  218,  University  of 
Rochester,  1987. 

[4]  W.B.  Thompson,  K.M.  Mutch,  and  V.A.  Berzins,  “Dynamic  occlusion  analysis  in  optical  flow 
fields”,  IEEE  Trans,  on  Pattern  Analysis  and  Machine  Intelligence,  PAMI-7:374-383,  July 
1985. 

[5]  B.K.P.  Horn,  Robot  Vision,  MIT  Press,  Cambridge,  MA,  1986. 

[6]  W.F.  Clocksin,  “Perception  of  surface  slant  and  edge  labels  from  optical  flow:  a  computational 
approach”,  Perception,  9:253-269,  1980. 

[7]  J.H.  Reiger  and  D.T.  Lawton,  “Sensor  motion  and  relative  depth  from  difference  fields  of  optic 
flows”,  Proc.  Eighth  International  Joint  Conference  on  Artificial  Intelligence,  1027-1031,  1983. 

[8]  E.C.  Hildreth,  The  Measurement  of  Visual  Motion,  MIT  Press,  Cambridge,  MA,  1983. 


5 


